Colour Me Red – the Ingect system for Research Data Collections

graphics1graphics2

By Peter Sefton (USQ) & Vicki Picasso (University of Newcastle).

The Australian National Data Service (ANDS) has funded a Metadata Store software development project by the Australian Digital Futures Institute (ADFI) at USQ. For the development ADFI will be working with the University of Newcastle in NSW and Swinburne University of Technology in Melbourne. The Metadata Store project is to develop a software system to store metadata around research data collections. The system doesn’t really have a definitive name at this stage however the work titles are:

  • ReDBox named for the world-famous white-board diagram that the authors published earlier this year, a backronym for Research Data Box. This is the working title of the system as it will be implemented at University of Newcastle
  • Ingect – named from a tweet-typo and is apparently a mixture if Ingest and Inject, which are two ways that descriptions of research data will find their way into the system.
  • EIF-040 – the name/number of ADFIs ANDS project code.

In this post we want to provide the ANDS-partner community with some information to outline the technical architecture of the system, as we see it at this stage as well as to briefly describe the model for implementation of the infrastructure. Its early days and as this is an agile development project some things may change as we go along, but well keep you posted on the story as it unfolds.

What are we building and why?

The basic application we’re building is similar to an institutional repository like Newcastle’s Nova except that instead of the system being full of PDF versions of research publications and theses, it’s going to contain descriptions of research data collections. In the planned installation of ReDBox for Newcastle the application will sit alongside the Nova repository as a management interface, and will push final published descriptions of data collections into the repository. From the repository these will then be harvested by Research Data Australia.

There are three main drivers for this:

  1. ANDS are building the Australian Research Data Commons will be accessed via a portal known as Research Data Australia. The idea is to make caches of research data discoverable, wherever possible in an open-access mode for download. Ideally, this means that the description of the data collection should also provide a stable link to the location of the data so that it can be accessed and used by others. But if that’s not possible then the description should provide contact information for the researcher. This should potentially enable negotiated access to the data. The outcome of either method is really about getting the best value out of hard-won data, tackling big problems and small.
  2. Compliance with funder requirements, codes of practice and legislation, in particular The Code.
  3. Research data is a valuable asset that requires management to maximise sharing, re-use and potential collaboration within the research community.

The basic architecture

The basic architecture of the overall repository will consist of two parts. Firstly, there is the existing Newcastle IR, Nova, which will be the place where published descriptions of research data will end up. And secondly, the ReDBox application will sit along side the IR to handle the ingest and discovery of research collections. ReDBox is being built using The Fascinator, an open source repository toolkit which is a product of the Australian Digital Futures Institute where Peter Sefton works. The Fascinator will use Fedora Commons as a storage layer, giving it enterprise-class repository stability. (Note that the current version is not using Fedora, but we are adding the Fedora back-end as part of the ReDBox project).

The diagram below shows the existing NOVA Fedora IR (in black) and the current ingest methods available for research publications to make there way into the repository (represented by 1, 2, 3, in blue). Newcastle uses VALET as a staging system to process publications going into the repository. NOVA has three instances of VALET (three queues) that enable different workflows to co-exist but all lead to the IR.

ReDBox (represented by 4. in red) will sit alongside the repository to provide a dedicated management system to ingest and process metadata records for the repository, as well as the additional entities required for RDA such as records for Parties, Activities, Services. More about these further along. Weve also identified a number of institutional triggers that can alert us to the presence of research data. These triggers will form the basis of an investigation. Institutional triggers can be about each institutions environment and processes. These are the ones we pursuing at Newcastle.

graphics3

Figure 1 : Leveraging off existing IR infrastructure

The above diagram is provided to outline the implementation of RedBox in context to how this will leverage off existing infrastructure at Newcastle. The Red and green components are what will be incorporated into the new system.

At Newcastle there will be data librarians who will curate the records in ReDBox and will be the ones who manage the process of pushing data from the management environment over the curation boundary into the repository.

Whats in the Box?

One important architecture choice hinges on what to store in ReDBOX and what will live elsewhere. The ANDS model for their research data registry has four classes of item, based on the ISO 2146 standard for registries, these are:

  1. Collections; the main point of this work. How granular collection descriptions should be and what data we need to describe them, in a way that allows the minimises data ambiguity and duplication will remain an ongoing question but this is undoubtedly the heart of the project. See our recent post on the CAIRSS blog about metadata issues.
  2. Parties, (people and organizations). There are lots of places in a university where people are described and records about them are managed, in systems run by HR, IT and the research office, not to mention forthcoming services from the National Library of Australia to manage identifiers for all Australian Researchers, via the ARDC-PIP project extending People Australia. What this means for the current project is that we don’t want to build yet another people-based system, yet we do need to sort out unambiguous identities for (at least local) researchers and for organisational units. While we do plan on managing and storing names (referred to as Parties) these will not be stored in the main ReDBox repository, but there will be an institutional service for managing name identities along with vocabulary terms like subject codes; more below. This will work as a look up service from RedBox.
  3. Activities; mainly research projects for our purposes. Again, activities often have centrally managed authoritative data sources about them. For example, the ARC has records about its projects, and we hope that one day it will be able to put them on the web as linked-data. Locally, each institution has data about activities as well, so as with parties we’re not looking to reproduce that, the solution is instead to harvest data about grants, research projects and so on and create proxy records for them in an authority service which can do the linked-data thing.
  4. And there are services. This is the most under-developed and probably least understood part of the ANDS information model. It would include things like repositories and data stores, we think.

Institutional Triggers

In addition to the basic web-interface (forms and queues) which will be provided through ReDBox there are also institutional triggers (Vicki’s term) which will assist data librarians in creating and managing records about data collections.

  1. Grants database. At Newcastle this is InfoEd. The idea is to set up an alert from the Grants database (e.g. new grant record has been created or a grant record has been closed) which will flow into ReDBox to flag that we need to investigate a potential research data collection. The alert could take the form of an Atom feed, or could generate a spreadsheet, or a brief record.
  2. Research Storage. The idea is to have a watcher service on our research storage utility to flag that new data has been moved into the designated storage area that may potentially be the signal that a new research data collection is available to be described and/or shared to the RDA.
  3. Self-identify. Web based interface that allows a researcher to self-identify data for publishing at any point in time. In addition, this will also be the tool for data librarians to create new records for data collections if they need to.

graphics4

Figure 2 : institutional triggers in relation to RedBox

At Swinburne, where Research Master is the grants database there will necessarily have to be some configuration and adaptation done to the Newcastle model; one of our goals on this project is to make sure that we have the right configuration hooks exposed so this is as easy, cheap and as risk-free as possible. It definitely won’t require changing the core application.

The ReDBox application is designed to be usable at any institution.

Authority services

There is one more service in this picture, the Linked Authority Control Service (LACS). This is the piece of the service puzzle which will allow us to produce linked-data ready collection descriptions. There is a more detailed report coming on progress so far with this component, which Debbie Campbell (NLA) suggests could be dubbed The Mint, but here is a quick summary of its functions:

  1. Act as an authority file, providing URI-name-lookup, using data that has been gleaned from other sources for:
    • Parties; people and organizational units.
    • Vocabularies, such as subject codes.
    • Ontologies, such as the ones being developed by the ANDS-VITRO consortium.
  2. Provide web-services for web forms, name-lookups, and taxonomy/vocabulary pick-lists. If a user filling out a form starts typing a name, the form will talk to the LACS and get back a list of potential matches. We’re including a screen-shot mock-up here from the NicNames project which shows the kind of thing we’re talking about:graphics5
  3. Where national or international infrastructure is yet to come on stream, provide linked-data endpoints for the above; with a view to handing this over to more sustainable owners ASAP.
  4. Maybe (and this is definitely a maybe) provide some link-checking services to make sure that any HTTP URIs used to identify things are resolving properly (the problem here is what constitutes send able resolution, and how to deal with services which are only offline for a short time).

Summary of data distribution

This may change as the project proceeds, but our starting point is the assumption that the core job of ReDBox is to worry about data collections, and the other main entities, parties and activities are the concern or domain of other systems.

ISO-2146

Institutional Systems

ReDBox + IR

LACS – The Mint

RDA

Other

Activities

Grants database

References via URI + human readable string

Proxy record

Registry entry

Eventually ARC and other funders should provide services

Parties

Grants database

References via URI + human readable string

Name-matching records mapping variants to a cannonical name

Registry entry

People Australia (in development)

Services

?

?

?

Registry entry

?

Collections

Not available

Full description

none

Registry entry

-

Colour me Red Options for deployment

There are a few different ways that ReDBox could be installed, with or without tight integration with an IR. It’s intended to be a flexible model to enable wide application. The model we’re assuming for this project with Newcastle is the first, for Swinburne it is the second:

  1. ReDBox will mange data collection records up until the time they are published via push to Fedora, the repository layer underneath Nova, the Institutional Repository. For records that need to be edited, they can be pulled back into ReDBox. This is as shown in Figure 2.
  2. Completed records are Harvested from ReDBox to a discovery layer/portal via Oai-PMH harvesting of Dublin core and/or other data that will help the discovery layer to build a usable browse-index of RedBox content. At Swinburne they are planing to use their Primo system as a discovery layer. A potential alternative at Newcastle is to treat VITAL as a discovery layer so for collection descriptions it would function as a publishing mechanism but would not contain the record of record.
  3. ReDBox could be used stand-alone using the built in portal in The Fascinator (currently considered out of scope for this project, but would be a relatively small project let us know if you are interested and we can talk to ANDS about funding :) .

In summary the models are to publish the research data collection records into the IR and to harvest to Research Data Australia via the repository.

Or to maintain the records within ReDBox and harvest to a discovery layer (Primo, etc) to Research Data Australia.

Other important attributes of the system

Teula Morgan from Swinburne has reminded us that one of the key requirements for any new class of system like this is that batch-editing needs to be possible, and easy. She reminds us that in the IR experience, we learned lessons along the way about what things should be called, and discovered better ways to describe things, and then in most cases found that it was not possible to batch-edit repository items. This was possibly partly due to the notion that a repository is a preservation system, and that curator-mediated ingest workflows would make sure the repository was ‘clean’, but really we do need to be able to edit things and to clean data. The Fascinator has some facilities already that can be used for batch-editing, but we will make sure that we try to keep the repository as clean and up to date as possible as we go, and build batch editing tools that will allow repository owners to do so into the future without too much programming required.

We have already run into our first cleanup job in the project adding a marc: prefix to some metadata records in NOVA so we can harvest the data therein. More on that in a future post.

Sustainability

As a final note its probably worth saying that the ReDBox model is built on the idea of providing a mediated support service. One where staff supporting the system will be able to interact with the interface for the purpose of workflows, reporting, and look-up services. At Newcastle the model will be that research data librarians will investigate the alerts coming into the system and follow up with data interviews to tell the story of the data collection. During the project we will be employing a Research Data Librarian to work on these things however the long term sustainable model is for the Faculty Librarians to support this service into the future, and for the library to be involved in the support of data management.

Another important dimension of sustainability is building a community of institutions using the same software. We’re eager to talk to institutions in Australia and beyond about how ReDBox / Ingect / EIF-040 might work for them and what levels of support they would need.

Copyright Peter Sefton and Vicki Picasso, 2010. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics6

This entry was posted in ReDBox (EIF-040). Bookmark the permalink.

2 Responses to Colour Me Red – the Ingect system for Research Data Collections

  1. Pingback: Tweets that mention Colour Me Red – the Ingect system for Research Data Collections | ANDS Partners -- Topsy.com

  2. Pingback: ADFI Blog » The Data Commons is coming: Presentation to Faculty of Science Research Committee

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>