Digital repositories in e-Science - Adding Value to Data
Special Session at IEEE e-Science 2008
Indianapolis
There is a great, untapped potential for synergies between grid/e-science technologies and a cluster of related systems addressing the management of digital assets in digital libraries and repositories.
This workshop, organized by DReSNet, addressed these issues in a series of presentations available below.
For the full programme of the event see Tobias Blanke's blog post.
Keynote
How Repositories can Learn from the Failings of the Grid
David de Roure
Video: http://live.escience2008.iu.edu/bin/meeting.html#ID=73
Slides: http://live.escience2008.iu.edu/slides/slides73.ppt
In his talk, David de Roure looked at how repositories can learn from the failings of the Grid. For the Grid, early adopters’ success led to the roll-out of new services and then the question where the users are. This is in many ways similar to repositories developments, when researchers need to be persuaded on how to populate repositories. In e-Science, success stories are often linked to big e-Science projects; everyday research has been less affected by the developments. This is also the case, as for too long data had not been at the centre of considerations.
David suggested looking at alternative models of middleware, away from the ideal of the one all encompassing middleware that the Grid was supposed to be. He cited effective, light-weight middleware layers in the web 2.0 worlds.
From a data-centric perspective, e-Science should be primarily concerned with object reuse and exchange and loosely coupled sets of linked data, with content all over the place and held together by semantic services. Repositories can play a major role here for new developments, if they move away from being simply an institutional repository where content and metadata is held together in the same place.
In his final thought experiment, David suggested to think of a world where content and metadata are held separately, e.g. by using cloud services in a kind of repository factory.
In the discussion, it was reiterated that one middleware is often impossible because the things it connects are fundamentally different. Content in the cloud, however, poses the problem of sustainability.
Accepted presentations
Andreas Aschenbrenner took in his presentation about Synergies between Grid and Repository Technologies - a Methodical Mapping the perspective of repository managers, who as they would like to most effectively administer institutional processes, would want convergence of their institutional repository instances. Then, standardisation is not bad for innovation but enables it if different perspectives are combined.
Video: http://live.escience2008.iu.edu/bin/meeting.html#ID=227
Slides: http://live.escience2008.iu.edu/slides/slides227.pdf
Andrew Treloar discussed the implications of effective cost management in large scale e-Research projects. Data is critical to research but repositories are currently not well suited to deal with it. There are no fully convincing ways of creating metadata automatically.
Andrew defined data as everything that is not documents and needs metadata about context, provenance and internal semantics. Who would then bear the costs of producing it To let the researchers enter data and associated metadata, does not seem to be sustainable. There are still no convincing arguments of what the exact benefits to researchers would be. Data custodians are expensive and software technologies to create things automatically are not in sight.
Video: http://live.escience2008.iu.edu/bin/meeting.html#ID=228
Slides: http://live.escience2008.iu.edu/slides/slides228.ppt
Andreas Hinze introduced the Wikidora system, which combines Fedora digital repository software with JSPwiki to build a solution for long-term accessibility of wiki content. Wiki pages are directly stored in a Fedora system and rights management, versioning and metadata services are shared between JSPwiki and Fedora repository.
Video: http://live.escience2008.iu.edu/bin/meeting.html#ID=230
Slides: http://live.escience2008.iu.edu/slides/slides230.ppt
Invited Expert Panel
http://live.escience2008.iu.edu/bin/meeting.html#ID=229
No slides available
The final expert panel was chaired by Tobias Blanke.
Participants were Adil Hasan (SHAMAN), Reagan Moore (DICE), Roger Barga (Microsoft), Andrew Treloar (ANDS), and Matthias Razum (FIZ Karlsruhe).
In the presentations from the panel members:
Roger Barga presented several solutions from Microsoft to help with different parts of the research life cycle. He emphasized the importance of capturing semantic relationships between digital objects in tools they generally use like MS Word. Capturing these relationships together with the workflows that produced them will lead to reproducible research.
Reagan Moore introduced iRODS – a software for digital preservation and distributed data into shared collections. The data management task complexity can be hidden virutalisation iRODS; a framework to manage
remote procedures on remote data. Distributed environments need recovery methods and iRODS provides unique workflow models for these.
Andrew Treloar laid out the Australian National Data Service (ANDS) project. ANDS is funded out of an Australian cyberinfrastructure programme, in order to develop Australian data frameworks, build capabilities and utilities. The aim of ANDS is to create a research data commons by either copying data into a central repository or providing virtual access to it. Data commons were defined by building persistent identifiers, harmonising metadata, and exposing further services for discovery and information visualisation. One of the bigger challenges of ANDS will be how to connect existing institutional repositories, as these are usually designed for documents. The software that is behind them is not optimized for large data objects.
Matthias Razum from FIZ Karlsruhe offered insights into the eSciDoc project by presenting some of the challenges they faced building aresearch infrastructure for a large multi-disciplinary research institution like the Max Planck Society and the answers they found to address these challenges. eSciDoc plans to offer the ability to link arbitrary metadata to digital objects and therefore link these from different stakeholder viewpoints. eSciDoc follows a well defined layered architecture building an infrastructure on topof which researchers and domain specialists can build solutions.
SHAMAN, as presented by Adil Hasan, is an EC funded project aiming at long-term preservation and making data understandable in the future. For that, they plan to use iRODS to decouple storage from data as well as the perseveration processes from data. For the preservation they use the multivalent tool. Adil emphasized that no infrastructure can be realised without buy-in from domain experts to make the data meaningful for the end user now and in the future.
It was first discussed that it is important to focus on the storage of data and whether open access to the data will be the right way forward. With open access, at least, it seems feasible to think of data in terms of how usable it can be for others. The latter, interdisciplinary perspective, is what e-Research and e-Science are all about. As different people from different research backgrounds will be interested in exploiting the data, next to open access sufficient context information will be highly important. The example of a historian of science interested in understanding science research data was quoted.
Linked to this point of creating new interdisciplinary, multi-institutional research teams is the problem of creating effective authentication and authorisation virtualisation techniques. Some panel members expressed concerns about the state of the Shibboleth federations in several countries. The technology would be ready but the policies are far from ready and institutions fail to buy in. It was therefore suggested that a possible alternative would be to rely on local login mechanisms (e.g. via OpenID) but limit the operations and services a user could run according to the trustworthiness of his or her login mechanism.
Provenance, the history of records in repositories, was identified as one of the main research topics in the integration of research data management in large-scale infrastructure. The Open Provenance Model was seen as a good way forward here as a way of representing provenance independent of a particular system. Within the model, provenance information as well as tools and services can then be exchanged between systems. Provenance is a particular issue with the massive collections expected from science, for which the processing needs to be automated. We cannot expect that effective management of these research data collections can be done relying solely on manual curation. We need to analyse possible automation not only of theprocesses of ingestion of data but also for assessment and quality assurance.
Private, shared and public research places seem to more and more merge within the research work flow. Roger Barga drew the picture of research that is always happening in a sandbox where all the context and provenance information linked to the research items are automatically recorded and publication into shared spaces of research groups as well as public spaces of formal publications can happen without too much interruption of research processes. Modern repositories supported by semantic technologies can help follow the chain of actions and data and the emerging web of knowledge, when data is propagated through various parts of the research processes. Repositories are then just the plumbing that has taken place in the background.
Further topics included:
- The creation of journals that allow for the publication of data alongside texts leading to reproducible research.
- Semantic databases to follow the chains of data
- Federation of independent repositories and sharing policies
- Data repositories should have a completely different feature set than the traditional institutional repositories
- How data management can be embedded in research workflows