This third and final post about the ANDS Queensland SWIG (Semantic Web Interest Group) meeting summarises our discussion of best practices for semantic web representation of research data and research metadata. Previous posts catalogued the projects represented at the meeting and documented issues about publishing common vocabularies.
Representing research outputs
While most of the SWIG discussion focussed on semantic web representation of metadata about research entities (metadata about data collections, projects, researchers etc), the group also discussed approaches to representing actual research outputs (such as datasets and annotations) using semantic web technologies. Projects discussed included:
- The Aus-e-lit project that has implemented a compound object authoring and publishing service based on the OAI-ORE semantic web model. Jane Hunter from their team discussed the advantages of using semantic web technologies to represent research data. She suggested that semantic web approaches can succintly model relationships between entities, such as the relationship between a “cleaned” dataset derived from a raw dataset. Semantic web representations also makes it easier to create visualisations of these high level relationships and present them to end users.
- The Aus-e-lit researchers have recently joined the Open Annotation Collaboration project. Social science, humanities, and crystallography communities create annotations in the course of their research, but find it difficult to share annotations between software systems. The Open Annotation Collaboration project aims to use semantic web approaches to move annotations across the boundaries of annotation clients, annotation servers, and content collections.
- The ADFI team at USQ have contributed to the Beyond the PDF project, examining semantic web approaches for linking papers to datasets and other related information.
- The Health-e-reef project used a high level ontology to unify coral reef survey observations. The ontology provided a way to unify observations from disparate data sources created by multiple independent projects and with different data strucures. The ontology models concepts such as Observations, Actors, Sites, Ecological Processes, and Measurements. James Cook University commented that they might use this ontology as part of their tropical data hub initiative.
The SWIG discussed how representing data, as opposed to metadata, using semantic web technologies introduces new scalability challenges (moving from thousands of RDF triples to millions of RDF triples). For example, CSIRO presented a paper at the 2010 eResearch Australasia conference about performance issues when querying and accessing the 30 million RDF triples in the Atlas of Living Australia dataset. CSIRO tested both open source and commercial triple stores, and observed execution times of over 8 hours for some queries, with variations of up to 9,000 times between some systems. In related work, Campbell Allen wrote a masters thesis that benchmarked RDF triple stores against a traditional relational database. He found that the relational solutions outperform the triple stores, especially for spatial queries. Campbell did observe, however, that
The Semantic Web RDF triple stores were found to be particularly suited to the data integration task at hand due the ability of ontologies to semantically define and link the data
The SWIG felt that these RDF triple store scalability issues may iron themselves out over time, observing that relational databases had to overcome similar scalability problems early in their development.
Representing metadata about research outputs
Many of the projects at the SWIG use terms from the ANDS-VITRO ontology to describe their research outputs. Some noted, however, that they had difficulty knowing how to apply the ontology and would benefit from documentation of best practice (or at least common practice) for representing research metadata as linked data.
The rest of this section provides an overview of common practice areas that might need documentation. The SWIG raised the topics below, but I have extended the discussion with more detailed examples to illustrate some points.
Best practices for using ANDS-VITRO and VIVO terms
The ANDS-VITRO and VIVO vocabularies can represent some research metadata concepts in multiple ways. For example, the VIVO ontology contains both vivo:webpage and foaf:homepage properties for describing links to webpages. Similarly, the ANDS-VITRO to RIF-CS metadata crosswalk document (available from the ANDS-VITRO google group) contains many properties for describing Agent names:
- bibo:prefixName
- foaf:firstName / foaf:givenName
- foaf:lastName / foaf:familyName
- vivo:middleName
- foaf:name / rdfs:label for display
The SWIG felt that the community would benefit from guidance on using these types of “overlapping” vocabulary terms. Peter Sefton from USQ related his experience in the ARROW project where agreeing on common usage early could have avoided interoperability problems down the track.
Supporting linked data principles
Melbourne, Griffith and QUT originally created the ANDS-VITRO ontology to represent ANDS registry objects within the VIVO system. More recently, non-VIVO uses of the ontology have also emerged, with some institutions using the ontology to create linked data representations of their research metadata. The documentation supporting the ontology, however, has inconsistent support for some of the principles championed by the linked data community. In particular, The How to Publish Linked Data on the Web tutorial suggests the following principle for choosing vocabulary terms:
In order to make it as easy as possible for client applications to process your data, you should reuse terms from well-known vocabularies wherever possible.
The ANDS-VITRO to RIF-CS metadata crosswalk document has mixed support for this principle. It supports the principle by recommending re-use of the Dublin Core description element (dcterms:description) for narrative descriptions of research data. On the other hand, the same document breaks the principle by recommending use of vivo:hasSubjectArea rather than the much more widely used dcterms:subject term for describing the topic of a research data collection.
These inconsistencies probably stem from the original purpose of the ANDS-VITRO ontology as a representation of ANDS registry objects within the VIVO system. The designers presumably did not consider linked data as a primary driver. Given SWIG interest in linked data representations, however, the community would benefit from a discussion about using more common linked data vocabularies and how these align with the ANDS-VITRO ontology.
Expanding the scope of ANDS-VITRO
The SWIG discussed expanding the scope of the ANDS-VITRO ontology.
The current version of the ontology only contains a subset of the concepts covered by ANDS registry objects. For example, it does not include properties for describing spatial and temporal coverage of a data collection, or alternative titles for registry objects. Some members of the SWIG reported how they use common linked data properties to include these concepts in their metadata (such as dcterms:spatial, dcterms:temporal, dcterms:alternative).
The SWIG also discussed expanding the ontology to cover concepts beyond those needed for the ANDS registry. For example, Newcastle University wish to model derivation relationships between data sets, such as the relationship between a “cleaned” dataset and a raw dataset. Many institutions also want to describe record keeping requirements relating to research data, such as information on how long to keep data, and how to dispose of it when appropriate.
Future work on the ontology might usefully compare these emerging practices, decide on a common approach, and document their use.
Where to from here?
This post summarises SWIG discussion of emerging practices for representing Australian research data and research metadata using linked data technologies. I hope it also highlights areas for future work: both in extending the scope of what we can describe, but also in agreeing on common ways to describe it.
Possible online forums for sharing emerging practice include this blog, the general ANDS partners mailing list and community bulletin board, and the more detailed ANDS-VITRO google group. Some of the issues, however, such as nutting out a process for maintaining the ANDS-VITRO ontology, probably require face-to-face discussion. Simon Porter from Melbourne University has suggested that the CCA-Educause conference in Sydney in April as a possible venue for another SWIG. Any takers?
Unrelated aside: this post represents my first attempt at using E-Prime to improve the clarity of my writing. I don’t think the experiment totally succeeded, but I certainly learnt a lot from the experience.
Written by Nigel Ward. Copyright The University of Queensland, 2011. Licensed under Creative Commons Attribution-Share Alike 3.0 Australia. <http://creativecommons.org/licenses/by-sa/3.0/au/>. 
The project is supported by the Australian National Data Service (ANDS). ANDS is supported by the Australian Government through the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative.