As various teams start to build research metadata stores1, it’s important to look back at the work done and lessons learnt in the Institutional Repository world. In this post I’ll discuss the issue of identifying “things” so that all repositories can have a shared understanding of data. It’s something that wasn’t well resolved in the IR community and threatens to reappear in our research metadata stores.
The many phases of IRs
If I can be so bold as to classify the phases of IR implementation we see in Australia, I might proffer the following:
- Phase 1 – Making publications available
- This first phase saw Australian Universities adopt an Institutional Repository for the storage and, generally, access to text-centric research outputs. These were primarily focussed at the institutional level but a harvesting interface was provided for services such as the NLA’s Australian Research Online (http://research.nla.gov.au/) system.
- Phase 2 – Integrating into the University data ecosystem.
- Maybe due to feedback from researchers being annoyed at having yet another data entry system or from other motivations, IR managers sought ways in which the IR could be a part of the University Research Management “ecosystem”. Examples include the provision of a pre-filled HERDC form or data exchange between the IR and the research management system.
- Phase 3 – Joining the Linked Data world
- In this phase, the IR can become more of a “global citizen” and share data. This is more than your OAI-PMH interface, this phase requires you to correctly (and universally) identify important pieces of data. The Linked Data effort uses URIs for this but the main idea is that if we identify something with the same ID then we may the same thing.
- Phase 4 – Semantic systems
- This allows the software to mine the data to find all types of interesting stuff. Our IRs present semantically useful information about their holdings and the holdings themselves are rich in semantic information. The system can then say “Hey, you’re a bridge engineer that works mainly in Spain, have you noticed the work being done on the relationship between bridges and river gnats in largely Hispanic regions? No? Then I have something for you”.
I avoid the use of the word “generation” here as each phase does not necessarily come in a specific order.
I would suggest that Phase 1 existed in the Australian Digital Thesis project and the array of IRs built under ASHER. Phase 2 requires decision-making at an institutional level so it’s not an easy goal. CAIRSS data would indicate a very small number of IRs are integrated to some extent but full data sharing is a rare beast. Some IR managers can’t even get a meeting with the required stakeholders so Phase 2 is looking shaky for them.
Phase 3 is a little way off. Many of the IRs don’t really “do” Linked Data yet. I’ve been trying to work out a way that, using RDF exports over OAI-PMH we could expose some elements as Linked Data. It’d need some data massaging but would put an IR in the Phase 3 basket without a major re-engineer. For data registries however, we could aim to be “born Phase 3” and properly identify subjects codes, people, organisations (and more) from the get-go.
There’s a lot of talk about Phase 4 and it’ll be cool to see it working. I also thought it’d be cool to have an intelligent robot hover car but it seems we’re a way off… I also fear the day that my library catalogue says “I have determined that your research is inferior to all of your peers and you should focus on your ability to dig holes.”…
So what? Hasn’t the IR problem been solved? Well, no. There’s still questions about how to handle non-text centric outputs (Arts repositories, as they’re generally named), how to distinguish author names and, for some, how to get the software to work2. It’s been known for some time that the design decision made for most IRs were a little on the inward-looking side. Whilst we all know it’s a journal article, we all called it something different. We also took subject codes and presented them in a different manner3.
Our work on data registries faces very similar problems to those that exist in IRs and my concern is that, without being careful, we’ll wade into the same river and start at Phase 1. I hope to present to you a basic piece of architecture that can be used to correctly identify the ANZSRC4 subject codes so that IRs and data registries can correctly identify the Field of Research (FOR) and Socio-economic Objective (SEO) codes. I’ll also cover some other areas that lend themselves to identifiers – all in the hope that we can aim our data registries at Phase 3.
Linked Data
Like so many overviews of Linked Data, I’ll quote Tim Berners-Lee:
Like the web of hypertext, the web of data is constructed with documents on the web. However, unlike the web of hypertext, where links are relationships anchors in hypertext documents written in HTML, for data the links between arbitrary things described by RDF,. The URIs identify any kind of object or concept. But for HTML or RDF, the same expectations apply to make the web grow:
Use URIs as names for things
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
Include links to other URIs. so that they can discover more things.
Use URIs as names for things
Linked Data utilises URIs (Universal Resource Identifiers) to uniquely identify something. You’ve seen URIs because URLs (Universal Resource Locators) are a class of URI. So http://ands.org.au is a URL which also means it’s a URI. Universal Resource Names (URNs) are also URIs but they are used to name something rather than locate it.
What? Well the biography of Isambard Kingdom Brunel can have a URN of urn:isbn:0140117520 and the URL http://www.amazon.com/Isambard-Kingdom-Brunel-L-T-C-Rolt/dp/0140117520 can take you to information about the book. The URN should be universally unique as it’s an ISBN but the URN just names a thing – it doesn’t take you to information about it.
The key component in URI is the Universal part. Unlike a relational database that contains identity fields (e.g. a primary key) and rely on the uniqueness existing within their system, data that seeks to link across the internet needs to use a global primary key.
Use HTTP URIs so that people can look up those names
A URI does not have to be a “web thing” – it doesn’t have to be something your Firefox browser can open. So the URN can name something but it’s always handy to find information about it. Basically, the expectation is that everything being described should have a home on the web. For example, I can be identified as http://duncan.dickinson.name/card. If you browse to that URL you’ll find my rarely updated home page – obviously you’re not looking at the real, physical me, just something that identifies me on the web.
However, a Linked Data browser will see http://duncan.dickinson.name/card.rdf. Using a technology named Content Negotiation I can send information in a format that suits my “reader” from the same URL. It’s handy to give a human something that they understand so it’s always worth providing an HTML page for any URI you’re creating.
When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
RDF (Resource Description Framework) is a W3C specification for describing data in a machine-readable manner. It’s not really something you or I sit back and read over toast like we do with all those HTML pages. RDFa is a method for adding RDF to HTML pages that lets us merge the human readability of a normal web page with markup that can be understood by computers. SPARQL is a method for querying the data but I won’t get into that in this post.
As mentioned before, provide output based on who’s reading your information.
Include links to other URIs. so that they can discover more things.
This creates the web of data – if you’re referring to something or someone, you should use their URI. For example, Name = ‘Duncan Dickinson’ is easy enough to understand for a human but a linked data browser that’s given Name = http://duncan.dickinson.name/card can follow that link to find out more information and even ask other systems what they know about me. Just like the human web, hyperlinked data gives us much more than just an isolated info-chunk.
The ANZSRC codes
Let’s return to the ANZSRC codes – they’re ripe, low-hanging fruit for our move to Linked Data. Most IRs use these codes but we use them in different ways and this makes it tricky to bring all of the metadata together. Our goal should be to identify ANZSRC codes at a national level in a manner that tells us that we’re all talking about the same thing.
One approach would be to define the various ANZSRC codes with the W3C’s Simple Knowledge Organising System (SKOS). SKOS is a standard for defining things such as subject heading lists and vocabularies – they basically define a set of concepts and provide pointers to broader, narrower and related concepts. Concepts can be arranged under concept schemes to help present them as a compiled group.
This looked to me to fit the bill for describing subject codes. In fact, this is what the Library of Congress has done for its subject codes and you can find these at http://id.loc.gov/. Each subject is defined and given its own URI that systems can point to and say “that’s what I’m talkin ’bout”.
As we are trying to be a Linked Data shop and we needed URIs for the ANZSRC codes I sat down and bashed out SKOS definitions for the Field of Research (FOR), Socio-economic Objective (SEO) and Type of Activity (TOA) codes. The end result looks something like these:
As a person using a web browser you should see a rather huge HTML page with the various codes and some gobbldygook under each code. If you go to http://purl.org/anzsrc/for#group_0501 you could see information such as5:
- It has a preferred label “050100 – ECOLOGICAL APPLICATIONS”
- It is in the FOR concept scheme
- It has a broader concept in the 05 Division
- There are narrower concepts such as 050101
It’s not pretty but it’s a start. It also isn’t what a Linked Data browser would see. The web server hosting the SKOS will provide different formats of the file based on the sort of client being used. So, if you were to go to http://dataviewer.zitgist.com/ and type in http://purl.org/anzsrc/for you’d be pretending to be a Linked Data Browser and will get RDF+XML back.
You may have also noticed that we’re using PURLs. These just map a URL to one we host at USQ but, in the long (and sensible) term, these SKOSes (or a better, community agreed upon version) could be hosted by a more central service.
Hoot Hoot?
What about the Web Ontology Language (OWL)? Isn’t OWL used by systems like Vivo? Yes, it is. SKOS concepts are OWL individuals6. You could also define the ANZSRC differently in OWL if you wanted. But this is the point I’m getting to – if we have a variety of definitions floating around, we’re pretty much stuck in the Phase 1 issue where each institution knows what its talking about but it all falls apart when we try to bring the data together.
Those wise OWL people out there will tell me that OWL has a construct called sameAs which is a way to indicate that my definition of “010000 – Mathematical Sciences” is the same as the one defined by you in another place. So we can all define our own versions of the ANZSRC codes and then point at everyone else’s definition and say “sameAs”. But, having done the copying and pasting to construct the data myself, I can tell you it is very boring and takes quite some time to just get the basics in7. From my Eprints experience I know that everyone just sent around the ANZSRC subject code file and uploaded it into their system8. Furthermore, with OWL we’re starting to wander away from Linked Data and into the Semantic Web Forest – more on that at the end.
So now is the time to get the idea of a central set of identifiers for the ANZSRC codes going BEFORE we all realise that we’re in Phase 1 again.
Resource Types
This is a bit of an issue for IRs at present . I’ll use my own institution’s Eprints system as an example here. If you were to do an OAI-PMH harvest you would find that our journal articles are actually called “Article (Commonwealth Reporting Category C)”. This may make some sense to Australians but almost no sense to anyone that doesn’t do HERDC stuff. Furthermore, if you dared to aggregate it with another IRs data that called Journal Articles something else then the computer gives up and tells you they’re different things. The NLA is currently tidying our data for ARO so that IRs like USQ’s that don’t follow the ARO guide are still represented in searches. Lucky for us that the NLA is so nice.
In the Linked Data world we could use the Bibliontology’s definitions of bibliographic types to identify what we are talking about. So you can call it an “Article (Commonwealth Reporting Category C)” but if you use http://purl.org/ontology/bibo/AcademicArticle to identify it, the aggregating system will know it’s the same as another IR’s “Journal Article”.
Does anyone have a URI for “Data Collection”? Well, Vivo does – http://vivoweb.org/ontology/core#Dataset. So maybe we can use that?
You could say that the RIF-CS harvest only goes to the Research Data Australia system so anything there is implicitly a research dataset. But this ignores the fact that, under OAI-PMH you still need to provide your metadata in Dublin Core so any OAI-PMH harvester could pick it up. Furthermore, we need to consider this in the broader, Phase 3 direction. With proper identification we can ask for “All of the journal articles and datasets under the subject heading 010000 – Mathematical Sciences”9 and get a real answer back.
People
I haven’t touched on the tricky nature of identifying people because it’s another (much longer) story. However, I will suggest that we should think carefully about the identifiers that are out there and how flexible our local implementations will be. The NLA is working on this at a national level and I think we’re likely to see local naming authorities feeding into that.
According to People Australia, Keith Miller has a URI of http://nla.gov.au/nla.party-506232 but he may also have an ACA identifier and a military service record number10. Our researchers will have a few identifiers at the moment – staff numbers, Thompson Reuters ResearchID etc.
The trick will be to use a stable URI that allows for the resolution of their other IDs. What’s important to remember is that, locally, some IDs may not be for public consumption (Staff ID is a good example) but are useful for accessing internal systems such as the expensive HR system they just bought. So I think that a local identification manager is likely to be needed.
Conclusion
Some readers may note that I haven’t brought up The Semantic Web to any great degree. Well, Linked Data is not the Semantic Web (SW) but it provides a foothold for the SW. If we use URIs to identify things11 then the Semantic Web Wizards out there can start mining our data and that could turn up some interesting stuff. From a Linked Data approach, I think it would just be good to say “Give me all journal articles and data relating to 010000 – Mathematical Sciences” and get a useful set of data that didn’t need to be carefully converted at aggregation time.
I’d like to leave with one major call for action: we need to sort out identifiers for the ANZSRC codes and for resource types (journal articles, datasets, etc). We need to do it soon so that our various data registries know what they’re talkin ‘about.
Self-promotion alert: On Phase 3, see the position paper I submitted for Link Affiliates to the ADL Registries & Repositories Summit (exec summary: Don’t Be Part Of The Dark Web).
You don’t need to do the Semantic Web stuff yourself. But as a data custodian, you shouldn’t be getting in the way of others doing Semantic Web with your data. That’s a problem with the way St Tim has been advocating the Semantic Web: the insistence on RDF should not get in the way of the URL infrastructure for the rest of Linked Data—the “foothold” that Duncan refers to at the end.
IRs however are often still at the stage of not even being googleable, let alone being Linked Data compatible. That’s clearly untenable: IRs have to be citizens of the Web, and not Data enclaves. It means, for one, that IRs have to adopt Google Sitemaps, since that’s how Google will explore your IR: they pulled the plug on OAI-PMH support two years ago.
This is a very engaging post that got me thinking about how repositories have evolved and what they might evolve into in the future. Providing URIs for things like FoR codes and resource types would be brilliant. Perhaps do the same for RFCD codes which may be present in some earlier repository records. It’s also an aggregator’s dream as it would dramatically reduce the need for data transformation and greatly improve resource discovery.
The ANDS funded ARDC Party Infrastructure Project (see https://wiki.nla.gov.au/display/ARDCPIP/ ) at the National Library provides an opportunity for IR Managers to include linked data about authors (researchers) in their IRs. QUT is to be congratulated in becoming the first early implementer of the party infrastructure. They provided a sample number of party records to the NLA from the QUT IR and this has allowed the project team to test and demonstrate how the party infrastructure works. These records can be viewed in Trove at http://trove.nla.gov.au/people/result?q=au+qut+gp
We have yet to see the first implementer who will pull back the NLA party IDs (persistent identifiers) and display them in their IR but when this happens it will be another small step toward your description of phase 3 for IRs.
The issue of consistent terminology is a really hard nut to crack. MACAR (Metadata Advisory Committee for Australian Repositories) had a go at a vocabulary for resource types, and many of the repositories followed that vocabulary. It included both dataset and datastudy. The next logical step would have been to provide a URI for it, but time was up.