Journal of eScience Librarianship Journal of eScience Librarianship

Institutional data repositories are the acknowledged gold standard for data curation platforms in academic libraries. But not every institution can sustain a repository, and not every dataset can be archived due to legal, ethical, or authorial constraints. Data catalogs—metadata - only indices of research data that provide detailed access instructions and conditions for use—are one potential solution, and may be especially suitable for "challenging" datasets. This article presents the strengths of data catalogs for increasing the discoverability and accessibility of research data. The authors argue that data catalogs are a viable alternative or complement to data repositories


Introduction and situation of the problem
Over the past several years, the number of academic libraries offering research data curation services has grown (Hudson-Vitale et al. 2017).Often this service coincides with the library hosting a standalone data repository (e.g., the Illinois Data Bank at the University of Illinois at Urbana-Champaign (University of Illinois 2021)) or an institutional repository that accepts data (Hudson-Vitale et al. 2017).However, a repository is not requisite for robust data curation services.A growing number of libraries have opted to implement data catalogs instead of or in addition to repositories to maximize discovery of their researchers' datasets.
Defined broadly, a data catalog is a curated collection of metadata records that describe and point to data products of interest.Data catalogs do not archive research datasets.Instead, they focus on increasing the discoverability of those datasets, "discoverable" being synonymous with "findable"-the F in the FAIR Data Principles (Wilkinson et al. 2016).But they are not aggregators like Google Dataset Search, which harvests and displays metadata directly from the web (Noy 2020): data catalog records are created, checked, and updated by professionals (frequently data librarians) through multi-step workflows that ensure their metadata is accurate and understandable to both humans and machines.By focusing solely on curating datasets for online discovery, data catalogs support the discovery, citation, and reuse of research data.
The reasons why libraries may develop a data catalog are myriad.Motivations from the authors' experiences include:

Capacity
• Storing and preserving datasets in all their different formats and sizes is not feasible for many academic libraries that may still wish to enable access to data.Data catalogs may be more sustainable.

Uncovering datasets that are otherwise hidden
• Data catalogs can describe high-value datasets that, due to access protocols, are not available anywhere else.For example, the NYU Data Catalog provides the authoritative description and access point for the protected Neurological Emergencies Outcomes at NYU (NEON) dataset (New York University 2021).
• Data catalogs can act as the point of access for datasets whose creators are reluctant to share their data in a public repository but willing to share upon request.The conversation with authors required to describe these private datasets also affords data catalogers an opportunity to suggest improvements to the data package, like the creation of READMEs.
• They can facilitate access to secured data by describing data governance, providing access instructions, and linking to request forms.
• Data catalogs can support research transparency by providing metadata for publicly-funded sensitive data that cannot be put online in full.

Fostering collaboration in scientific and institutional communities
• Some data catalogs describe externally-created datasets that are widely used for secondary analysis (e.g., large national surveys) and add value by naming a "local expert" who is willing to act as a contact/collaborator/ mentor for institutional researchers (Read et al. 2015).
• Data catalogs can serve as a source for educational or training material, especially if an institutional author is available to answer questions about the data.The metadata fields in a data catalog can be used to parse out large datasets suitable for training algorithms, for example, or for practicing analysis using particular scripting languages.
• Data catalogs can complement community infrastructure that a library may already support.Institutional members of a community repository, e.g.Dryad Data Repository, can continue to deposit data there while the data catalog enhances its discoverability.
• Data catalogs allow datasets to be archived where they are most likely to be found and used.Data are more likely to be cited if they are archived in a disciplinary repository and indexed in multiple locations (Mannheimer, Sterman, and Borda 2016); a catalog record for a dataset hosted elsewhere contributes an additional index point while making the institution's relationship to the dataset explicit.
• Libraries can co-locate all datasets created at the institution in one institutional data catalog, which serves as an institutional data inventory or marketplace.

Complying with data governance requirements for sensitive data
• A data catalog can be set up to capture who is authorized to access confidential data according to data governance.Weill Cornell Medicine, for example, has integrated its catalog with its Data Core (secure data enclave) management system (Oxley 2020).This allows the institution to rapidly review requests for access and monitor changes in authorization for projects and datasets.By developing a data catalog that focuses on governance metadata, the organization has advanced its ethical responsibilities towards the handling of confidential data by both enhancing visibility of government constraints, and increasing the research value obtained from those data sets.
• Data catalogs can provide an audit trail for datasets containing clinical patient data.Weill Cornell's data-enclave integration tracks each dataset's initial registration, the purposes for which it is being used, who accesses it, and the conditions of users' authorization, helping to ensure patient confidentiality (Oxley et al. 2018).

Integrating with existing library and campus infrastructure
• In situations where researchers have deposited data products in multiple locations, data catalogs can pull together related material in one record.This also applies to data products hosted in campus enterprise services.A hypothetical complex record could include structured metadata describing and linking to: • Dataset (processed data) in the organization's institutional repository • Large dataset (raw data) in the organization's Globus instance or other high-volume file transfer platform • Analysis code in Github • Registered protocol at protocols.io.
• Data catalogs provide an additional discovery layer for institutional repositories, which were viewed favorably by users in one study for their preservation functions but less so for "searchability" (Shen 2017, 120).
• Data catalog records may be indexed in a general library catalog by using Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) to extend a discovery system's ability to surface datasets.

Specifics of data curation activities within catalogs
As nexuses for discovery, data catalogs describe data products with standardscompliant metadata that creates linkages among related records and serves structured data to search engines.They are searchable and browsable, although the "how" varies according to each platform's architecture.In the Data Discovery Collaboration (DDC), to which the authors and their institutions belong (Data Discovery Collaboration 2021), defined metadata fields include but are not limited to: dataset title(s), creator(s), data and resource type, description, keywords, file size, software and equipment used to collect or analyze the data, funding information, and associated publications.Metadata entries are drawn from controlled vocabularies and URI-referenced whenever possible.Crucially, records include instructions for accessing the data, which may involve submitting an application or contacting the author.Institutions vary in the specific metadata schemas and standards implemented in their catalogs, but Datacite, DATS, Dublin Core, and schema.orgare represented among the catalogs in the DDC.
The focused scope of a data catalog allows data catalogers to act nimbly, beginning with identifying data to describe.Information about a dataset that is a candidate for cataloging may enter an institution's workflow in any of three general ways (Ratajeski et al. 2019): a) Submitted by the dataset creator, either because the researcher discovered the catalog independently or because s/he was solicited by a cataloger (e.g., through a "cold email" asking whether they had any datasets they would be willing to share, which may uncover datasets otherwise unavailable.)b) Identified and cataloged by a cataloger, with varying degrees of participation from the dataset creator.A dataset deposited to a major repository with full documentation may require very little, if any, further information from the author, while a dataset identified only through a paper's "data available upon request" statement may require much more communication.
c) Automated by using strategies such as harvesting data repository APIs, then enhanced by a data cataloger.Examples include Montana State University's Dataset Search, which finds datasets by institutional authors in external data repositories and presents them to data catalogers for manual review and record creation (Mannheimer et al. 2021).
Path a) resembles the workflow of many repositories, particularly large-scale and domain-nonspecific repositories, where staff may promote their repository's services but rarely make specific collection requests.Paths b) and c), in contrast, take a proactive approach to discovering datasets.Path b) casts the data cataloger in the role of the eventual data consumer, faced with the task of finding relevant data like the proverbial needle in a haystack-but with expert searching skills to help.Path c) lessens that burden with automation, providing the cataloger with a list of likely candidates (of datasets, publications with data availability statements, or simply known data-producing authors) winnowed from systems like faculty information systems, REDCap reports, or PubMed article alerts.
An examination of a data catalog's activities using the Data Curation Network's C-U-R-A-T-E-D framework (Data Curation Network 2018) will illustrate where the catalog's curation energies go.In this framework, each of the letters in the word 'CURATED' stand for a step in the DCN's data curation process.Many of the steps are analogous to steps in the data catalog curation process, with a key difference in focus.Data Curation Network stewards aim to improve the quality of a dataset and its accompanying metadata for submission to a repository.Data catalog stewards focus almost entirely on improving or creating entirely new metadata, as "hidden" datasets (e.g., available only upon request from the author) rarely have author-supplied metadata already.
Note that the details of each step below will vary among data catalog-maintaining institutions.Note too that since many data catalogers are data librarians, they may also have separate conversations with researchers about their datasets' quality, especially if they are submitting data to a repository to which a data catalog record might then link.

Setting standards for data catalogs
Traditional card catalogs and their online public access catalog successors, aiding in the discovery of individual items in a collection, have been a mainstay of libraries for over a century.Libraries that support data catalogs, however, are relatively new and few in number.In 2017, several academic health science libraries organized into a loose network called the Data Catalog Collaboration Project (DCCP) to offer support community for those working toward the shared goal of enhancing the findability of datasets.Each member institution (some with funding from the Network of the National Library of Medicine) indexed their biomedical research data with local instances of an open-source data catalog platform created at the founding member institution, New York University (Lamb and Larson 2016).The DCCP brought a cross-institutional perspective to addressing usability, data sharing workflows, metadata, and outreach to improve data discovery.
At a February 2020 retreat, current and potential institutional members met to reassess the intent and priorities of the group.Among the outcomes were a new name, the Data Discovery Collaboration (DDC); the creation of a steering committee; inclusion of non-health sciences institutions; and a purposeful shift towards a broader, platform-agnostic approach towards data discovery that would be focused on developing standards and best practices.The Mission Statement of this reimagined organization reads: "To enhance discovery of data and other research products in order to maximize their value" (Data Discovery Collaboration 2021).
In its new iteration, the DDC enables data discovery in its broadest forms through a governance structure designed to encourage participation from member organizations through working groups, listserv discussions, and Slack channel conversation.Data catalogs are no longer a requirement for membership, but they remain a central topic in addition to metadata creation, interoperability, and code sharing.Table 2 summarizes the current member institutions and their data catalog platforms: Table 2: DDC members and their data catalogs as of June 2021 The Data Discovery Collaboration welcomes involvement with both potential new members (individuals and institutions) and unaffiliated organizations who share its goal of supporting data reuse by increasing discoverability of data products.The collaboration's core groups are currently working on projects such as building out metadata elements for basic/bench science data; packaging the NYU Data Catalog code into a Docker container for easier installation; recruiting an advisory board to help the DDC navigate the technological and social aspects of data discovery; and sharing strategies for promoting data sharing and reuse within our institutions.To join the conversation, please contact any of the authors or visit the Data Discovery Collaboration website (https://datadiscoverycollaboration.org).
Though less well-known than their repository relatives, data catalogs are a powerful tool for curating research data in the library.Their focus on data discovery makes them ideal candidates for a wide range of settings and purposes, as shown by just some of the diverse use cases presented in the Data Discovery