Introduction
Responding to research requests for assistance with locating data repositories is an important element of research data services. To effectively respond to scenarios like, “I want to (or must) share my data, but it can’t be fully anonymized and remain valuable for research,” it is necessary to determine what repository options exist for sensitive data. A call from the Data Services Continuing Professional Education (DSCPE) program for capstone projects provided the opportunity to create a list of data sharing repositories, which allow for the sharing of restricted data via data use agreements (DUA) mediated by the repository itself. The DUA review process typically ensures users have a genuine research need, are appropriately trained in privacy/security, and have access to the appropriate infrastructure to safely manage the data.
Project Criteria
Without a DUA process implemented by a data repository, data are typically shared either with no mediation or with basic researcher mediation — via requests for access being facilitated by a fully automated process within a repository or via email. Both of these put a lot of responsibility on the corresponding author or data depositor and are largely inappropriate for data containing sensitive variables. DUAs are not perfect solutions, nor are they the only way to achieve properly mediated access to sensitive data, as they are a friction point for data access that can create significant workloads that do not always align with staffing. DUAs, however, are a standard mechanism for facilitating data sharing that respects terms and conditions. They allow the review of requests to be taken on by repositories designed for this kind of data access, instead of leaving the onus on individual researchers and their institutions.
The first step to generating a spreadsheet of data repositories that accept sensitive data and offer mediated DUAs was brainstorming known repositories like ICPSR, QDR, Figshare, and Vivli, followed by identifying lists of data repositories compiled by various agencies, institutions, and publishers. The following data repository lists were reviewed:
OSF Approved Protected Access Repositories
NIH Generalist Repositories
Simmons Data Repositories
Linguistics Data Consortium
Springer Nature Recommended Repositories
CESSDA
Repositories listed on these sites that were marked as accepting sensitive data were added to a spreadsheet for further investigation. Based on the stakeholder needs of the host institution, repositories needed to be based in the United States and have a social science or multidisciplinary focus.
Creating a Repository List
The initial review of the repositories focused on confirming the repository name, website address (or URL), and whether it accepted sensitive data. Review then moved to determining each repository’s disciplinary focus, noting whether it provided a mediated DUA, and creating a repository description. Some resources listed in these aggregators were defunct, did not accept data deposits, or were datasets rather than repositories. For others, it was not initially possible to determine the type of data protection process used by the repository. This required additional investigation into the repository documentation, and in some cases, reaching out directly to the repository administrators proved necessary.
Ultimately, 16 US-based data repositories that accept sensitive data within the social sciences and offer mediated access via DUAs were identified and organized into a spreadsheet. To improve the usability of the spreadsheet for data services librarians assisting researchers, additional details were needed. This led to including information about whether there were costs to deposit and/or access data, the file formats accepted, the curation requirements, and the process to deposit data. This resource is available on OSF at https://osf.io/k9u5x for others to expand on and use in the course of their own work.
Documenting the Process
In addition to creating the spreadsheet of data repositories, each step of the process was documented. This documentation will help anyone who wishes to expand the resource to understand the decisions taken, the current organizational structure, and general terminology. Establishing general terminology addresses a fundamental need across data repositories, as repositories lack a shared vocabulary, which proved particularly challenging during this project when parsing out important details about DUAs and restricted data. To overcome this issue, a glossary of terms was included in the documentation to create a shared vocabulary for anyone using the resource.
Project Limitations
While re3data, DataCite and FAIRsharing are all popular and extensive lists, the quantity of repositories listed in them exceeded the time parameters of the project. The Restricted Access Repositories spreadsheet also only provides additional information for repositories matching the project criteria, which limits the usefulness of the spreadsheet for researchers outside the social sciences. In addition to reviewing lists of repositories excluded in this project, there is significant value to be added to this resource by expanding the details related to science, humanities, and medical research data sharing, which were generally out of scope for the initial iteration of the project.
Conclusion
In our repository review, ICPSR, a generalist repository with mediated data deposit, stood out as an excellent repository to consider for social science researchers looking to archive sensitive data. The other data repositories outlined in our resource are varying degrees of discipline specific, which would be great choices if a researcher’s dataset aligns with their focus.
Although many researchers never need to deposit sensitive data, for those who do, resources that support this need are pivotal for their success. Funders that require data sharing are no longer as quick to accept to the impossibility of sharing sensitive data and instead are looking for researchers to make decisions that will support sharing in some way: be that by developing fully anonymous datasets or by depositing sensitive data into a repository with appropriate policies, processes, and security infrastructure to support proper management of the data.