Full-Length Paper Discovery and Reuse of Open Datasets: An Exploratory Study

Objective : This article analyzes twenty cited or downloaded datasets and the repositories that house them, in order to produce insights that can be used by academic libraries to encourage discovery and reuse of research data in institutional repositories. Methods : Using Thomson Reuters’ Data Citation Index and repository download statistics, we identified twenty cited/downloaded datasets. We documented the characteristics of the cited/ downloaded datasets and their corresponding repositories in a self-designed rubric. The rubric includes six major categories: basic information; funding agency and journal information; linking and sharing; factors to encourage reuse; repository characteristics; and data description. Results : Our small-scale study suggests that cited/downloaded datasets generally comply with basic recommendations for facilitating reuse: data are documented well; formatted for use with a variety of software; and shared in established, open access repositories. Three significant factors also appear to contribute to dataset discovery: publishing in discipline-specific repositories; indexing in more than one location on the web; and using persistent identifiers. The cited/downloaded datasets in our analysis came from a few specific disciplines, and tended to be funded by agencies with data publication mandates. Conclusions : The results of this exploratory research provide insights that can inform academic librarians as they work to encourage discovery and reuse of institutional datasets. Our analysis also suggests areas in which academic librarians can target open data advocacy in their communities in order to begin to build open data success stories that will fuel future advocacy efforts.


Introduction and Background
A fundamental role of libraries is that of information access provider, and at its core, data discovery is simply a form of information access.The expertise developed in libraries is therefore applicable to data discoverability, with traditional cataloging and archiving skills closely paralleling the skills required to curate and preserve data.Building from this foundation of information access, libraries are well-equipped to suggest data description practices and repository features that will encourage discovery and reuse (Witt, Carlson, Brandt, and Cragin 2009;Wallis, Mayernik, Borgman, and Pepe 2010;Faniel, Minor, and Palmer 2014).In this article, we analyze twenty cited or downloaded datasets and the repositories that house them, in order to produce insights that can be used by academic libraries to encourage discovery and reuse of research data in institutional repositories.
Todd Vision describes data as "a classic example of a public good, in that shared data do not diminish in value" (2010,330).This sentiment is a guiding tenet of the Open Data movement, which aims to make research data freely and publicly available.The movement has been strengthened in the United States by two recent policy developments: first, major funding agencies have begun to require data management plans (NIH 2003;NSF 2011), and second, several prominent journals (Dryad 2011;PLOS 2014) and the Office of Science and Technology Policy (Holdren 2013) have issued policies requiring that supporting data be published alongside associated articles.These policies encourage the practice of open data publishing for two key reasons.First, open data supports reproduction and validation of research (Santer, Wigley, and Taylor 2011;Lutter, Barrow, Borgert, Conrad, Edwards, and Felsot 2012).Second, open data encourages the repurposing of research data in order to promote new discoveries and advance science (Kelder 2005;Faniel and Jacobsen 2010).
The ascendency of the Open Data movement has resulted in a growing number of repositories providing access to research data in addition to publications.The Registry of Research Data Repositories1 currently includes over eight hundred repositories run by institutions in the United States and over fifteen hundred repositories worldwide (Registry of Research Data Repositories 2016).Discipline-specific repositories like National Center for Biotechnology Information2 and Worldwide Protein Data Bank3 facilitate disciplinary data sharing, while general-purpose repositories like Dryad4 and Figshare5 attempt to fill the gaps by housing data from a range of disciplines.These general data are often "long tail data" -described by Wallis, Rolando, and Borgman as tending to be "small in volume, local in character, intended for use only by [the research team], and less likely to be structured in ways that allow data to be transferred easily between teams or individuals" (2013).Institutional repositories in academic libraries -initially built to provide open access to publications (Crow 2002;Lynch 2003) -are a natural fit for institutionally-produced research data, and especially long tail data that may not fit the scope of a discipline-specific repository.
Data sharing culture varies between scientific disciplines.Strong cultures of data sharing exist in geophysics, molecular biology, and ecology (Nelson 2009).Social scientists and medical researchers -who often produce human subject data or other sensitive data that requires more effort to prepare for publication-are less likely to share their research data online (Tenopir et al. 2011).Although the practice of open data sharing is on the rise, the literature has yet to clearly demonstrate whether published datasets are being discovered and reused.As Wallis, Rolando, and Borgman (2013) inquire, "if we share data, will anyone use them?"Publishing data openly is only the first step toward successful data sharing.To realize the goals of the Open Data movement, published datasets must be discoverable and reusable.
Promoting discovery of datasets is a complex process.Researchers have traditionally found data by reading published literature, talking with professional peers, or searching trusted data repositories (Zimmerman 2007).Recent work has also explored semantic web applications to promote web-scale discovery of open access repository resources through implementation of the Research Description Framework (RDF) and schema.orgmetadata (Latif, Borst, and Tochtermann 2014).However, RDF and schema.orgmetadata implementation have only been preliminarily explored in the specific context of data repositories (Rosati and Mayernik 2013).Beyond discoverability, once a researcher finds an applicable dataset, the data must also be reusable.White et al. (2013) suggest three strategies to encourage data reuse: (1) document data well; (2) format data for use with a variety of software; (3) share data in established repositories with open licenses.In this article, we aim to identify common characteristics of cited/downloaded datasets and their repositories.We propose that these common characteristics can be used to provide insights for academic librarians looking to increase discovery and reuse of datasets published in institutional data repositories.

Methods
Measuring reuse of datasets is a difficult endeavor.Researchers have used several different methods to attempt to track reuse.Piwowar, Carlson, and Vision (2012) searched Google Scholar for the accession number, DOI, and journal name for 100 datasets, in order to find studies that mention dataset reuse in the text of the article.Chao (2012) identified the affiliated publications for datasets in order to take advantage of more traditional bibliometric measurement methods.In a smaller-scale study, Belter (2014) used a combination of Web of Science, full-text search capabilities provided by journal publishers' websites, and Google Scholar in order to measure reuse of oceanographic data sets.
In order to identify datasets that have been discovered and reused, our research team opted to use data citation counts and data download counts.We consider citation count to be a more accurate measure of reuse than download count because citations are proof of use, whereas downloads simply hint at use.As Konkiel writes, "we cannot be sure if downloads mean that the dataset has been used in any way, just as we cannot be sure that downloads of journal articles guarantee a paper has been read" (Konkiel 2013).Consequently, citations are our preferred metric of measurement.However, we also consider download counts to be a useful metric for measuring reuse, partly due to the results of a 2015 survey conducted by Kratz and Strasser.The authors write: "We asked what metrics researchers would most respect when evaluating a dataset's impact.Respondents considered number of citations to be the most useful metric; 49% (n = 119) found citation count highly or extremely useful.Unexpectedly, a substantial 32% (n = 77) felt the same way about number of downloads" (Kratz and Strasser 2015).While download counts are a less concrete measure of reuse than citations, this survey result indicates that researchers believe that download metrics can reflect reuse.
Tracking downloads in addition to citations in this study was also necessary from a practical standpoint.Including datasets published in institutional repositories was important to gain a broader understanding of dataset characteristics that may influence discovery and reuse.Since few citation statistics were available for institutional data repositories, we were compelled to include downloads as a measure of dataset reuse for these repositories.
We used Thomson Reuters' Data Citation Index (DCI) (Thomson Reuters 2016a) to identify cited datasets.The DCI is a subscription-based database on the Web of Science that indexes data repositories and reports the number of articles that cite individual datasets.In order to index data repositories, the DCI requires that the repositories be "demonstrably active, whether by continued maintenance and curation of the data sets held, or by addition of new materials, evidenced by data deposition statistics" (Force and Robinson 2014).When choosing data repositories to index, the DCI also looks for robust metadata, evidence of repository persistence, funding statements, peer review, and links between datasets and the research literature.The DCI continually monitors the repositories it indexes for availability, quality, and relevance to the DCI.The DCI does not track citations itself, but rather aggregates this information as collected by data repositories (Thomson Reuters 2016b).
In order to select datasets for our analysis, we assumed that more data has been published and reused in recent years, due to data archiving mandates from academic journals and funding agencies.In order to provide a sufficient amount of time for datasets to be discovered, used, and cited, we limited our results to data published in 2013 (three years prior to this study).It is important to note that since our search was limited to datasets published in 2013 and indexed in the DCI, each dataset chosen for analysis in this paper may not be the highestcited dataset in its repository -it is merely the highest-cited dataset that was published in 2013 and shows citations in the DCI.Our initial search returned 763,057 cited datasets.Since the DCI limits the amount of data a user can extract to blocks of five hundred cited datasets, we downloaded the top-cited one thousand datasets.From these one thousand datasets, we chose the fourteen repositories with the highest median citations per dataset in the DCI (see Figure 1).We then conducted our exploratory analysis using the top-cited dataset from each of these fourteen repositories.
Among the repositories with the highest median citations in the DCI in 2013, the number of citations drops quickly from a median citation of eight in the Australian Antarctic Data Center to a median of one citation per dataset in the Animal QTL Database, as illustrated by Figure 1.
Since there were no institutional data repositories with citations reported in the DCI for 2013, we produced a convenience sample of six Digital Library Federation member institutions: If the institutional data repository indicated a most-downloaded or most-cited dataset, we used that dataset for our analysis.If no repository-wide download statistics were available, we selected a highly-downloaded dataset.
The datasets in our final twenty results reflect either citations in the DCI or a high number of downloads in an institutional repository.We documented the characteristics of the cited/ downloaded datasets and their corresponding repositories by reviewing publicly-available information on repository websites and inputting our observations into a self-designed rubric.The rubric addresses the characteristics of cited/downloaded datasets and their repositories by grouping them into six major categories: basic information; funding agency and journal information; linking and sharing; factors to encourage reuse; repository characteristics; and data description (see Appendix A for blank rubric; the completed rubric is available from Montana State University ScholarWorks http://doi.org/10.15788/m2059z).The rubric allowed us to identify common characteristics of cited/downloaded datasets.

Results
From our sample of twenty cited/downloaded datasets and their corresponding repositories, we identified the following characteristics from which we can gain insight into factors that may encourage discovery and reuse.
Our analysis reveals that the cited/downloaded datasets in our sample generally comply with the basic recommendations for facilitating reuse, outlined by White et al. (2013) Beyond these best practices for reuse, we identify additional factors that appear to influence dataset discovery (see Table 1).
 60% (12/20) of the datasets analyzed are indexed in more than one location on the web.For example, the most-cited dataset in our results is available from the Australian Antarctic Data Centre; additionally, metadata and a link to the dataset are available from the Global Change Master Directory.
 A persistent identifier also appeared to influence discovery and reuse; all (20/20) of these cited/downloaded datasets have a persistent identifier, eight of which are Digital Object Identifiers (DOIs).
 Data mandates also appear to contribute to citations and downloads; of the fifteen datasets that disclose an external funding source, nine (60%) are funded by agencies that require data archiving.The cited/downloaded datasets in our results can be grouped into five broad disciplines: Climate Science, Ecology, & Environmental Science; Genetics, Genomics, & Evolution; Chemistry; Biochemistry & Molecular Biology; and Engineering (see Figure 2).This finding reinforces existing research showing that some disciplinary cultures support data sharing and reuse more than others (Nelson 2009;Tenopir et al. 2011).This culture of reuse extends to the creation of discipline-specific repositories in these disciplines.If data repositories are established elements of the disciplinary research ecosystem, researchers are more likely to discover and reuse data from those repositories, regardless of metadata, file type, or other factors.
This finding suggests that datasets are most easily discoverable in discipline-specific repositories.It seems to follow that libraries should recommend that researchers deposit in major disciplinary data repositories.Unfortunately, our research showed that, of the fourteen disciplinary repositories in our study, only two (~14%) had preservation policies.This stands in contrast to the institutional repositories in our study, four out of six (~67%) of which had preservation policies (see Figure 3).Revisiting the datasets analyzed in our rubric five months after initial data collection, two out of the fourteen datasets in discipline-specific repositories (~14%) were unavailable online.On January 24, 2016, the Treebase Repository produced a 502 error, and the Animal QTL Database reported that the persistent identifier for the cited dataset in our analysis could not be found.While this trend suggests a conflict between discovery and preservation, our small sample size of 20 repositories limited the scope of our results; a larger study would allow for more conclusive results.Still, a key library mission is to ensure long-term preservation of information.Since the value of research datasets will persevere, preservation is an important consideration.Librarians should carefully evaluate discipline-specific repositories through the lens of preservation before making recommendations to researchers.

2: Academic Disciplines
Upon closer examination, two seemingly significant characteristics -whether a repository provides a suggested data citation, and whether a dataset underlies an associated research publication -are less clearly related to discovery and reuse.While 55% (11/20) of cited/ downloaded datasets are published in repositories that offer a suggested citation, 20% (4/20) of the repositories we analyzed suggested citing the associated publication, rather than providing a suggested citation specific to the dataset itself.We produced a similarly unclear result when analyzing whether cited/downloaded datasets are associated with a specific research publication.While 75% (16/20) of the cited/downloaded datasets in this study are associated with a publication, only 37.5% (6/16) of those publications cite or link to the associated dataset.Past research has suggested that researchers find data by reading published articles (Zimmerman 2007); this practice does not appear to be reflected in our research.However, while the absence of data references in the associated publications analyzed here is an interesting result, our sample size is too small to seriously call into question whether an associated publication directly leads to dataset discovery.

Discussion
The results of this exploratory study generated insights that may help academic libraries encourage discovery and reuse of institutional datasets.
From our analysis, it appears that the following factors may facilitate dataset reuse:  Robust data description  Non-proprietary file types  Publication in open access repositories Many libraries already provide guidance in these areas.Our research suggests that extending and expanding these services would be beneficial. Publication in prominent, discipline-specific repositories (after evaluating for sustainability and preservation activities)  Cross-indexing between institutional data repositories, discipline-specific repositories, and discipline-specific metadata catalogs Finally, our analysis suggests that cited/downloaded datasets are:  Funded by agencies that require data publication  Produced by researchers in a few specific disciplines These findings suggest that academic librarians may be able to target open data advocacy in their communities, directly soliciting datasets from certain disciplines and from grant awardees whose funders require data publication.By providing these targeted services, libraries are wellpositioned to encourage discovery and reuse, and to begin to build open data success stories that will fuel future advocacy efforts.

Limitations and Future Directions
This exploratory research produced promising insights into the factors that influence data discovery and reuse.We note a number of limitations to our study.Our exploratory approach included a small sample of datasets (n = 20), including a convenience sample of institutional data repositories.Also, the Data Citation Index is an imperfect tool to measure data citations.We note three major limitations related to our use of the DCI to conduct this research.First, the DCI relies on direct reporting from repositories.This limits our discipline-specific repository results to data citations that have been reported to the DCI.Second, the DCI did not report citations for data in institutional data repositories for 2013.In order to include institutional data repositories in our sample, we had to extend our reuse metrics to download statistics -an even less clear-cut measure of reuse.Third, for several of the datasets with a single citation reflected in the DCI, the dataset and the citing article are created by the same author.Lastly, our study was limited by an unfortunate reality: in the data sharing community, there is an absence of standard data citation practices or other data reuse metrics (Parsons, Duerr, and Minster 2010).While the DCI evaluates indexed repositories for overall quality (Thomson Reuters 2016b), it does not evaluate the accuracy of each repository's data citation tracking methods.In order to gauge the efficacy of the strategies suggested by this research -and in order to conduct future, more conclusive research -we must first be able to reliably measure data reuse.
A 2011 editorial by Michael Whitlock suggests that scientists who reuse data could go so far as to offer co-authorship to original data creators."At the very least," he writes, "whenever data are reused, researchers must cite not only the paper or papers in which they were originally described, but also the data package itself" (Whitlock 2011, 63).Proper attribution is also an important incentive for data sharing.Tenopir et al.'s 2011 study found that one of the most important conditions that scientists had for sharing data was that they receive proper citation credit from those who use the data.The Digital Curation Center recommends that data be cited in the manner of traditional publications -as entries in an article's reference list (Ball and Duke 2015).However, this recommendation has yet to receive full uptake from the scholarly community -datasets rarely receive traditional citations (Sieber and Trumbo 1995;Mooney and Newton 2012;Robinson-Garcia, Jiménez-Contreras, and Torres-Salinas 2015).In 2010, a preliminary study found that only 33% of repositories, 6% of journals, and .02% of funders suggested a best practice for data citation (Enriquez et al. 2010).Even if datasets are cited, these citations are often difficult to track, due to the lack of established practices for documenting data reuse (Mayernik 2013).
Progress is being made to promote data citation practice.DOIs provide a single, persistent identifier that can be used for citation and access of datasets (Simons 2012).The DataCite project further facilitates discovery and machine-readability by providing dataset DOIs that include an underlying dataset-specific XML metadata scheme (Starr and Gastl 2011).CODATA-ICSTI released reports on the landscape of data citation in 2012 (National Research Council 2012) and 2013 (CODATA-ICSTI Task Group 2013), and a cooperative working group released the Force11 Joint Declaration of Data Citation Principles in 2014, stating that "data should be considered legitimate, citable products of research" (Data Citation Synthesis Group 2014).The Making Data Count project (Making Data Count 2016) further examines and encourages data citation practices and data reuse metrics.As these initiatives help establish standards for data citation, data reuse will become easier to track.With better data citation tracking, more robust conclusions can be reached regarding how to support the discovery and reuse of datasets.

Conclusion
The Open Data movement aims to encourage data availability for the purpose of discovery and reuse.Through analysis of cited/downloaded datasets and their corresponding repositories, the exploratory research described in this paper reveals two complementary insights:  The common characteristics of cited/downloaded datasets and their corresponding repositories can provide direction for librarians looking to facilitate discovery and reuse of datasets published in institutional data repositories.

:
 The data are documented well.95% (19/20) of the datasets analyzed have a readme file or extensive metadata.
 The data are shared in established repositories with open licenses.All twenty datasets are published in established repositories, and all twenty datasets are openly accessible.However, only four out of the twenty datasets have explicit licenses, all of which are Creative Commons Licenses -three are licensed CC BY, and one is placed in the public domain using CC 0.

Table 1 :
Characteristics of Cited/Downloaded Datasets


Persistent identifiers, especially DOIsThis research suggests that datasets published in discipline-specific repositories may be more discoverable.However, discovery is only one part of the open data equation; librarians should carefully evaluate repositories' preservation activities before making recommendations.Once datasets are deposited and published in trustworthy discipline-specific repositories, institutional data repositories can provide metadata records for these datasets in order to further encourage discovery.Correspondingly, libraries can request that datasets published in the institutional data repository be indexed in appropriate discipline-specific data repositories or catalogs.DOIs and other persistent identifiers also appear to facilitate discovery and reuse.
The fact that these datasets are indexed in the DCI suggests better discoverability; however, while the single citation reflected in the DCI for these datasets does indicate use, it may not indicate reuse.Our decision to search the DCI only for datasets published in 2013 likely affected the number of citations per dataset.Future research should investigate all datasets in the DCI, in order to better identify datasets with high numbers of citations.