The Purdue University Research Repository (PURR) is not only a research data repository for Purdue’s researchers, but also a suite of services to help researchers manage their data throughout the research lifecycle. PURR’s objectives include responding to funders’ mandates, enabling broader access to scholarship, and increasing the impact of Purdue’s research.
Why data sharing is important
In 2013, the United States Office of Science and Technology Policy began requiring the 22 largest federal funding agencies to address data management plans (DMPs) and the issue of data sharing within grant proposals. Agencies in many other countries, including the United Kingdom and Australia, have similar policies.
Beyond requirements, data sharing is good science. By sharing their data, Purdue’s researchers enable others to reproduce and validate their research findings, providing the researcher with transparency, accountability, and material support to strengthen their findings. Data reuse also enables colleagues to save time and money.
Lastly, and perhaps most compelling for some, researchers can get “credit” when others use and cite their data. Researchers can count citations toward research impact and include it within their CV or promotion and tenure portfolio. As this practice becomes more common, researchers’ motivation to share data and external consideration for this reuse will increase.
PURR, the libraries and the research office
Dimensions of Discovery (Winter 2013). Office of the Vice President for Research, Purdue University,
The National Science Foundation is the biggest funder of research on Purdue’s campus, so when the NSF began requiring DMPs with grant proposals in 2011, the research office took notice. The Purdue University Libraries (the libraries) had been already prototyping data repositories and piloting various data services, so there was a base of knowledge that was known to the Vice President for Research. From this footing sprung a collaboration among the libraries, campus IT, and the research office to develop the PURR repository (based on HUBzero®, an open source software package) and services.
From the libraries’ perspective, playing a lead role in PURR dovetailed with the three pillars of its strategic plan: learning, meeting global challenges and scholarly communications. Policies and established practices were in place for:
- Data collection development
- Digital preservation of data
- Data within information literacy, outreach and instruction
- Data in reference and public services
- Technical services such as metadata consultation
Librarians also collaborate as co-principal investigators on large data-producing projects. Librarians are interested in stewarding the scholarly record, and of course, data is part of that scholarly record. The university’s sponsored programs services support these efforts. If they see a proposal where the researcher plans to use PURR in the DMP, they notify the libraries so a librarian can follow up and offer assistance.
PURR offers four main categories of service and will continue to refine and develop new services as the data management landscape evolves.
- Create and implement DMPs: Within PURR, researchers can find boilerplate text for DMPs, real-world examples, a self-assessment, and links to other resources such as the DMPTool. PURR staff also conduct workshops and offer consultations with subject specialists, librarians who help researchers write and implement DMPs.
- Collaborate: When researchers create a project in PURR, they can invite collaborators to join a private project space where they can share and develop data and code with functionality such as version control, wiki updates, and other project management features. These spaces, sometimes called virtual research environments (VREs), have a default 10 GB of free storage space for three years with added storage incentives for public sharing and sponsored projects.
- Publish: The researcher selects files that make up the dataset and provides the associated metadata (such as title, author, abstract, and subjects); selects a license; decides whether to embargo the data; and submits it for publication. A digital object identifier (DOI) is assigned to the dataset, which facilitates citing and tracking. Researchers receive emails with monthly usage statistics on their published datasets, and PURR tries to track citations so that people have an idea of the data-sharing impact. Research data includes more than tables, spreadsheets and databases. Images, video, audio, observation logs, scientific workflows, software source code, interview transcripts, and survey instruments and results can all be published.
- Archive: PURR uses existing standards such as:
- BagIt file packaging format
- Metadata Encoding and Transmission Standard (METS)
- Metadata Object Description Schema (MODS)
- Dublin Core Metadata Element Set
- Preservation Metadata: Implementation Strategies (PREMIS)
When a dataset is approved for publication, PURR establishes the fixity for the data, serializes the descriptive and PREMIS preservation metadata to XML, and packages everything using BagIt, a Library of Congress software tool. The BagIt bag is essentially the archival information package, or AIP, which is replicated to seven sites using LOCKSS (Lots Of Copies Keeps Stuff Safe), a digital preservation system. Purdue is a member of the Meta Archive Cooperative, which manages the LOCKSS network’s infrastructure and governance. PURR commits to maintaining the data for a minimum of 10 years. At the end of 10 years, the data is remanded to the libraries and is subject to data or collection management policies and practices, which include appraisal and selection/deselection.
Organization and staffing
The PURR Executive Committee includes the Dean of Libraries, the Vice President for Research and the CIO. It meets once a semester to discuss high-level resource and policy issues. Once a month, the PURR Steering Committee meets to set policies, priorities and direction for PURR. This group includes two representatives each from the libraries, campus IT, and the research office, and three faculty members who are doing data-driven research (to represent user interests).
PURR staff include:
- Project director (0.50) — develops roadmap and manages the team
- Technologists (3.85) — develop functionality beyond the HUBzero platform
- HUBzero liaison (0.35)
- Metadata specialist (0.20) — advises on standards and best practices for metadata, and consults on questions and issues that come up in the organization and description of datasets
- Digital archivist (0.25) — designs and implements the archival system, incorporating standards and best practices
- Digital data repository specialist (1.00) — leads engagement, outreach and support for PURR on campus
- Data curator (1.00) — identifies stages and ingests new data collections
Subject liaison librarians are unofficial members of the team. They are notified every time a researcher creates a project in their subject areas. If a grant award is registered and associated with a project, the librarian is notified and receives the DMP (if available). This provides subject liaison librarians with the background to approach faculty and offer support.
PURR tracks several metrics to understand the system’s usage and success. In its first two years, PURR has been written into more than 1,400 DMPs and has tracked 163 grant awards and about 100 citations to data in PURR. PURR is approaching 500 research projects and 1,300 registered researchers.
The services and platform continue to evolve rapidly, and the approach has been iterative — to learn and improve services as they grow. At the same time, Purdue has been experimenting with new roles for librarians and libraries relating to research data and the practice of librarianship. Having a data repository has been useful to help create opportunities for librarians to engage researchers in these issues and to develop new library practices to support data-driven research and learning.