Big or small data?
Research data has always been at the core of much scientific research, though the primary conduit of scientific communication has been the peer-reviewed journal article. The article summarizes, synthesizes and interprets the raw data, places the data in the context of theory and hypotheses and mechanisms, and takes a point of view on the data. However, it does not provide sufficient details on the data to facilitate integration within larger data contexts, or alternative synthesis and interpretation.
The era of big data launched with advances in technology power and analytic software, accelerated in part by a seminal conference and book on the future of scientific communication.1 Big data is a fast-growing trend, resulting in great demand for open data programs2 and influential studies highlighting the problems and challenges with the current informal data practices.3,4 In response, the research community has made open publishing of research data a core part of scientific research and communication. Many have argued that the value of the journal article will decrease4 as the value of research data archives increases over the next few years, and recommendations for what needs to be done abound.3,5,6,7,8
The research community is increasingly making raw and summarized research data available for preservation and use by other researchers, both by linking to publications and placing directly into open repositories. At present, researchers in most scientific disciplines – genomics, astronomy and physics are exceptions – make little research data available to other scientists, for reasons including lack of credit, lack of distribution control, and fear that others will point out key insights that they overlooked. Another important factor is a lack of the effort and the informatics expertise required to standardize and normalize the data, and to add sufficient provenance and the descriptive metadata required for domain-specific data repositories.
More researchers participate in loading data sets into repositories that require little metadata or informatics support. These repositories do not integrate data sets on the same topic, are not easily discoverable, and provide little opportunity for analytics. Nevertheless, funding bodies and government organizations are calling for massive increases in the sharing and availability of research data.
One key difficulty in building sustainable services around research data is that scientific research is built largely on tiny niches of research — a “long tail effect” — with many thousands of small data sets. The problem is better thought of as small data, but the need is just as real.
Research data flow and services
Figure 1 illustrates the general flow of research data; this model has been built up from various reports of the research and education community. It also highlights work that could be done to disclose and increase the value of research data.
The first step (archive) is required, and a number of self-serve options are already available. All others are optional and discipline-dependent.
These services could include:
- Deposit in preservation archive and/or register in appropriate database with citable DOIs (with researcher controlling distribution and timing). Where possible, the data should be deposited in discipline-specific data repositories to maximize value to the research community in terms of discovery, analytics, comparisons and pattern detection.
- Normalize data (align to taxonomies, link to existing entities, scale values to reference models where appropriate); anonymize where needed
- Perform standard checks for accuracy and typographical errors (e.g., materials properties, where there is evidence of the problem)
- Annotate with standard descriptive metadata. This critical step makes the data discoverable, and can require effort in the data-capture phase
- Annotate with provenance metadata and descriptions (very difficult in some disciplines)
- Obtain peer review where appropriate and requested
- Track and report downloads, usage and citations to:
- Funding agencies
- Link between data sets and papers
- Enhance discoverability (mostly nomenclature, but also registrations or search services as needed) such that visualizations, analytics, special reports, and transformations across data sets are possible
- When appropriate, create solutions for specific tasks, such as diagnostics or design
Elsevier has much to contribute in all of these areas and has a unique depth of insight provided by its activities in publishing (journals, books) and institutional performance reporting (SciVal® suite of products).
Refining key services through pilots
We started this year with several pilots, working with research institutions and data repository owners (e.g., Columbia University, Duke University, Carnegie-Mellon University, and University College London) to increase the flow of shared, open research data. Together we are working out the processes and resources needed to most efficiently share more research data in the target disciplines. These initiatives will only be undertaken in collaboration – never unilaterally – with the academic and research communities. After the pilots end, we will examine funding models to make the process sustainable, and to explore the need for other services, e.g., establishing new databases for data that have no good home today (such as neuro-imaging), or providing analytics based on domain informatics expertise.
The two most important principles for these services are:
- Data must be open and shared, with distribution controlled by the creator of the data (when possible).
- The model must be derived in collaboration with the research community and funding agencies, not driven by Elsevier or any publisher.
Publishing research data is a labor-intensive job, requiring special data science, process management, annotation and informatics skills. Institutions and labs will determine the best allocation of resources to meet these needs among their own researchers, librarians and other internal staff, and external services. Organizations at the cusp of establishing a research data management program (increasingly demanded by funding bodies) may want advice on jump-starting and formalizing the program. Those further down the road may be interested in introducing additional expertise, efficiency and rigor.
If you would like to be notified when Elsevier has further news about these research data initiatives, e-mail:
1. Hey, T., Tansley, S. & Tolle, K. (Eds.). (2010). The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, WA: Microsoft Research.
2. Boulton, G. (2012). Open your minds and share your results. Nature 486, 441. doi:10.1038/486441a.
3. Hodson, S. (2012). JISC and Big Data: Simon Hodson at Eduserv Symposium 2012.
4. The Ode Project. (2012). Ten Tales of Drivers & Barriers in Data Sharing. The Hague: Alliance for Permanent Access.
5. The Royal Society (2012). Science as an Open Enterprise. The Royal Society Science Policy Centre Report 02/12. London: The Royal Society.
6. OECD (2007). Principles and Guidelines for Access to Research Data from Public Funding. Paris: OECD Publications.
7. National Academy of Sciences (2009). Ensuring the Integrity, Accessibility and Stewardship of Research Data in the Digital Age. Washington: National Academy of Sciences.
8. European Commission (2010). Riding the wave: How Europe can gain from the rising tide of scientific data. Final report of the High Level Expert Group on Scientific Data.