This is the complete article originally shared as a three-part series. Download the PDF.
Research data has always been at the core of much scientific research, though the primary conduit of scientific communication has been the peer-reviewed journal article. The article summarizes, synthesizes and interprets the raw data; places the data in the context of theory and hypotheses and mechanisms; and provides an interpretation of the data. However, in its current form, the article alone does not provide sufficient details of the data to facilitate integration within larger data contexts, or to allow for reconstruction of the experiment or alternative analyses, syntheses or interpretations.
The era of Big Data launched with advances in technology power and analytic software (1) and propelled a fast-growing trend, resulting in great demand for open data programs (2) and influential studies highlighting the problems and challenges with the current informal data practices (3,4). Many have argued that the value of the journal article will decrease (4) as the value of available research data increases over the next few years, and recommendations abound for what needs to be done (3,5,6,7,8).
Reasons for the low participation by researchers include a fear of being scooped and a sense of a lack of rewards for storing and sharing data.
As a result, there is increased pressure in the research community to make research data (raw and summarized) available, both linked to publications and directly into open repositories, for preservation and use by other researchers. At present, in most scientific disciplines (genomics, astronomy, physics are notable exceptions), little research data are made available to other scientists. Reasons for the low participation by researchers include a fear of being scooped and a sense of a lack of rewards for storing and sharing data. In the big picture, many researchers feel they do not currently have proper incentives for sharing their research data compared with the long-term, career-related incentives (i.e., tenure) of having articles published. In the day-to-day picture, many fear it being a time sink and are not clear on what is required from funding body mandates. Presently many systems and tools are in place to store research data in domain-specific, institutional, local and global repositories. However, no coordinated set of practices or even instructions exist to enable the majority of researchers to incorporate effective modes of research data management into their workflow.
Funding agencies are increasingly concerned with improving the reproducibility of research and allowing the public to hold scientists accountable for the results of their experiments. They are implementing policy statements to improve data storage, curation and sharing (9,10). It is not clear yet where the burden for compliance will ultimately land. At many research institutions, libraries, IT departments, and offices of research are increasingly preparing to meet that obligation, on behalf of and in collaboration with the institution’s scientists, engineers and scholars.
The goal of this article is to sketch a view of the various aspects involved in managing, describing, preserving and making research data available and accessible to appropriate audiences, and to propose a series of projects to tackle issues preventing their effective implementation. After sharing our views of the current state of research data management inside institutions, we propose pilot projects with a number of institutions to explore how to provide research data management designed for the needs of each specific institution. Each engagement will be unique, but together these projects can paint a landscape of needs and solutions. Each party can reference this landscape in determining how best to contribute value to the most effective and efficient solution, and how to jointly move forward.
A. Research data management in institutions: Stakeholders and information flows
Figure 1: Overview of the parties involved in RDM within an institution and the research data information flow among parties
Source: Image created by Victor Henning and Anita de Waard, © 2013 Elsevier.
Technical and policy changes herald a brave new world of linked research data, but different participants in the research data management workflow feel the pressure to bring about this change. Figure 1 sketches the various stakeholders within the institution and the flows of information involved. In particular:
- Researchers have to conform to reproducibility requirements for their data, and need safe, efficient and policy-compliant tools and processes for storing and annotating their research data.
- Data Repositories are asked to deliver more cost-effective ways to dramatically increase the volumes of data they curate and store. Though usually separately funded, these repositories are technically located inside an institution, and share physical and technical infrastructures with the campus.
- Libraries run the risk of being disintermediated in an open access world, and are looking for ways to use their skills and systems to connect research data to the repositories and knowledge management systems they curate.
- Offices of Research Administration are anticipating the need to track the full set of digital artifacts created inside the institution to ensure compliance with contractual data sharing policies.
Several types of information flows connect these parties:
- The data flow: As data is created by researchers it gets deposited and curated in one (or more) of a multitude of possible repositories: the institutional repository (IR), external (whether domain-specific, e.g., Protein Data Bank, PetDB, or domain-agnostic, e.g., DataDryad, Figshare) research databases, or cloud-based storage facilities such as Dropbox.
- The indexing flow: To allow cross-repository search, these data must be indexed.
- Usage reporting: For compliance and merit assessment purposes Research Offices are interested in usage and viewing data for the deposited research data.
Although clearly interconnected, these different stakeholders and these complex interdependent information flows are not centrally managed or manageable. Funding and technical requirements are independently driven, and there are no platforms for these groups to connect, either organizationally or technically. Before we can arrive at an optimal flow of information for research data within and between institutions, the following bottlenecks need to be addressed:
1. Researchers: Ensure that research data is stored and curated during creation
Current assessments agree that, depending on the domain, between 70–90 percent of the research data created is currently not stored outside of a researcher’s own lab (11). To overcome this bottleneck, research data needs to be captured together with the environmental, process and protocol details that are critical for understanding provenance or reproducibility, using standardized (taxonomic or controlled-vocabulary) names where possible. Much of the documentation happens after the collection, but tools to capture metadata at the time of data collection allow for a quicker and more accurate way to capture experimental circumstances.
When raw data are captured or created in research, they often have to be transformed or normalized before they are useful for comparison with other similar experiments. Any transformations or adjustments in data from raw form, as well as any workflows for handling or manipulating the data, need to be documented in detail using standard nomenclatures as much as possible. The data need to be labeled with standard nomenclatures from subject-specific vocabularies to allow greatest discoverability, comparison and use by other scientists. Any standardized entities (reagents, animal species and strains, antibodies) used or produced in the research need to be unambiguously defined as part of the research metadata.
2. Data Repository: Increasing amount and decreasing cost of manual curation
Over the past 10 years many domain-specific databases have been set up that, mostly through the manual curation of papers, provide a processed summary of research results. The contents of these databases are usually of a very high quality and they are invaluable assets in a given field (12). They risk losing funding and are looking for sustainable models to maintain their services. Developed in relative isolation, many of them are created on dedicated technology platforms not optimized for integration with external search and indexing services. Also, the curation process is usually manual and might benefit greatly from text mining and other semi-automated approaches.
3. Institutional Repository/Library: Keeping track of all data created by researchers
Because of researchers’ reluctance to share their data and the multitude of places to store it, most IRs are only storing a small portion of the data produced at the institution. To improve the awareness of data produced at the institution and maintain an index of it, interoperable metadata layers are needed that track where and how data was stored and shared, and therefore allow at least institutional cataloguing of data outputs (akin to a publication index). To make this possible the role of the IR in research data management must be clearly defined and the IR must track where data was created and is stored. This requires an increased level of interoperability with other parts of the institution, in both organizational as well as technical systems.
4. Research Office: Enable usage reporting on all data created by researchers
Apart from compliance with funding requirements, obtaining credit for shared data is probably the biggest driver to motivate researchers to increase the amount of research data shared. The use and impact of shared data are tracked by a few services, but no good definitions or interfaces as yet combine statistics on all the repositories where data originating from a specific institute ended up. And there are no agreed metrics on what constitutes a good impact assessment for a dataset. Standards and systems are needed to enable such cross-repository tracking of deposits, usage and downloads; these would provide scientists, the institution and the funding agencies a wide set of mechanisms for usage and impact reporting.
C. Potential Pilot Projects
To help academic institutions to prepare for the future of data-driven science, Elsevier’s Research Data Services team is interested in establishing a series of pilot projects with a number of universities. We are especially interested in pilots where different departments from the same university are involved, including research units, the IR, the library and the Office of Research. We are proposing the following pilot projects that will connect various stakeholders within the institution:
1. Researcher/Repositories: Data management tools
To help researchers capture data in a more effective and efficient format, tools are needed that map to the researchers’ workflow and allow them to store, retrieve and process their data, including an easy means to reference protocols as well as enable direct capture of data. The ideal Electronic Lab Notebook allows researchers full freedom to model and capture their workflow and manipulate the data, while keeping all data under their control. In a perfect tool, an export function would allow the data to be shared or uploaded into either an external repository or a publishing system, and the heritage of the data will let (re)viewers of the data examine the steps undertaken to prepare it.
We are interested in exploring systems that would allow researchers to decide where to store and whether to share their data on a granular level: in an external data repository if appropriate (and, if so, in a format that minimizes tedious curation tasks on behalf of the repository), in the institute’s IR or an external cloud-based solution as appropriate. The metadata created during the process of data creation and curation is similar for each of the options to minimize duplication of effort and optimize cross-repository search.
2. Institutional Repository/Office of Research Administration: Integrated compliance reporting
We would be interested in exploring standards and systems to report on the deposition, viewing, downloads, citation and use of research data created by scholars in the institution. Optimally we would explore a wide range of data to report on, as stored in external repositories and cloud-based solutions or in the institution’s IR. We would want to explore the types of reporting and compliance measures required or anticipated by a series of funding agencies and experiment with reporting standards and tools.
3. Researcher/Library: Integrated data search
Because of the variety of metadata standards, discovery of what is in these repositories is tedious, especially for those who have to check several repositories for several records. We are advocating the development of a unified metadata layer for repositories to make querying much more efficient for the general types of requests outlined in the prior use cases. We see the librarian as a key specialist to help develop the standards and practices of curation of the data with such metadata.
Working with existing standards and curation efforts, such a “Unified Metadata Layer” could be incorporated into the researcher’s workflow and connect to the IR and domain-specific or agnostic research databases. A shared layer of metadata components will not only make it easier for researchers to record and annotate their experiments and the data generated, it can also leverage access to standard vocabularies and to automatically create the metadata needed to accompany experimental results.
Please feel free to contact our team (see contributors below) if you have any further questions or comments about this effort, or are interested in arranging a meeting to discuss research data at your institution in more detail.
1. Hey, T., Tansley, S. & Tolle, K. (Eds.). (2010). The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, WA: Microsoft Research.
2. Boulton, G. (2012). Open your minds and share your results. Nature 486, 441. doi:10.1038/486441a. http://www.nature.com/news/open-your-minds-and-share-your-results-1.10895
3. Hodson, S. (2012). JISC and Big Data: Simon Hodson at Eduserv Symposium 2012. http://www.youtube.com/watch?v=dTJft2wkHdA
4. The Ode Project. (2012). Ten Tales of Drivers & Barriers in Data Sharing. The Hague: Alliance for Permanent Access. http://www.alliancepermanentaccess.org/wp-content/uploads/downloads/2011/10/7836_ODE_brochure_final.pdf
5. The Royal Society (2012). Science as an Open Enterprise. The Royal Society Science Policy Centre Report 02/12. London: The Royal Society.
6. OECD (2007). Principles and Guidelines for Access to Research Data from Public Funding. Paris: OECD Publications.
7. National Academy of Sciences (2009). Ensuring the Integrity, Accessibility and Stewardship of Research Data in the Digital Age. Washington: National Academy of Sciences.
8. European Commission (2010). Riding the wave: How Europe can gain from the rising tide of scientific data. Final report of the High Level Expert Group on Scientific Data. http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf
9. Research Councils UK (2013). RCUK Policy on Open Access and Supporting Guidance. http://www.rcuk.ac.uk/documents/documents/RCUKOpenAccessPolicy.pdf
10. Office of Science and Technology Policy (2013). Increasing Access to the Results of Federally Funded Scientific Research. http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
11. DANS, The Dutch data landscape in 32 interviews and a survey, e.g. fig. 3.1, http://www.dans.knaw.nl/en/content/categorieen/publicaties/dutch-data-landscape-32-interviews-and-survey
12. Data centres: their use, value and impact, RIN Study, Sept 2011, http://www.rin.ac.uk/data-centres