Introduction
Data discoverability (Findability) is a core principle of open scholarship and the FAIR principles (Wilkinson et al. 2016). An optimally findable dataset is discoverable by humans and machines without prior knowledge of its existence. Data should also be findable by entities other than researchers, as data reuse can occur outside of the research ecosystem (e.g., Wallis et al. 2006; Tenopir et al. 2011; Borgman 2012; Gregory 2020; Gregory et al. 2020; Sun et al. 2024). For example, the Generalist Repository Ecosystem Initiative (GREI) recognizes four use cases (Staller et al. 2023; Van Gulick et al. 2024): (1) a researcher sharing data; (2) a researcher searching for data; (3) a funder tracking compliance and impact; and (4) an institution tracking compliance and impact.
Various approaches to discover affiliated research datasets have been developed (e.g., Barsky et al. 2016; Lafia and Kuhn 2018; Mannheimer et al. 2021; Sheridan et al. 2021; Van Wettere 2021; Briney 2023; Alawiye and Kirsch 2024; Dellureficio and Pokrzywa 2024; Johnston et al. 2024; Wang 2024; Wink 2024; Lostner and Krzton 2025; Warner 2025). These efforts have identified as many obstacles as they have identified solutions. Johnston et al. (2024), a study conducted through the Realities of Academic Data Sharing (RADS) initiative, recently summarized existing barriers to institutional discovery. These include variation in: (1) recording and crosswalking of metadata; (2) utilization of persistent identifiers (PIDs) such as ORCID (Open Researcher and Contributor ID) and ROR (Research Organization Registry); (3) selection of minimum metadata fields; (4) quality control measures (e.g., curation); and (5) linkages to related objects (e.g., articles).
Commercial solutions that are either designed for research data (e.g., Elsevier’s Data Monitor) or that have been expanded from other scopes (e.g., Clarivate’s Web of Science) are widely used. However, proprietary solutions run counter to open scholarship principles due to subscription fees and revenue-driven decision-making; limited interoperability; closed-source processes that make it difficult to assess accuracy and completeness of retrievals; and biases in inclusion of content. These are well-documented for platforms designed for articles (e.g., Vieira and Gomes 2009; Franceschini et al. 2016; Mongeon and Paul-Hus 2016; Vera-Baceta et al. 2019; Zhu and Liu 2020; Pranckutė 2021; Visser et al. 2021) and for datasets (e.g., Benjelloun et al. 2020; Chapman et al. 2020; Sostek et al. 2024). For institutions, limiting subscription dependency is also important for ensuring financial sustainability. The sunsetting of Data Monitor (closed-source, subscription-based; Elsevier 2025) due to insufficient uptake further underscores the risks of commercial solutions.
Open-source solutions offer an alternative. Open-source software (OSS) is widely deployed in academic and non-academic settings, from QGIS (QGIS 2025), an alternative to ArcGIS Pro for geospatial data, to the Dataverse Project (Crosas 2011), a data repository framework. OSS is not without potential shortcomings, including the frequent need to develop and maintain it without substantial resources, but there are also significant advantages, including transparent processes; opportunity for broader community engagement; flexibility in design; avoidance of subscription dependencies; and the ability for others to freely access, reuse, and repurpose the software. Open-source solutions also exemplify best practices for research data and software management that digital scholars promote to researchers: that metadata, data, and software should be maximally FAIR.
This paper describes an open-source Python workflow (Gee 2025a) for discovering affiliated research data publications. It was developed in the context of The University of Texas at Austin (‘UT Austin’) but can be easily adapted for other institutions. Through a combination of broad queries that account for variation in construction and location of affiliation metadata and narrow queries that target specific repositories (e.g., Figshare) or modalities of data sharing (e.g., absence of affiliation metadata), the composite workflow has higher coverage potential than previous OSS workflows (e.g., Johnston et al. 2024) and is intended for further development (https://github.com/utlibraries/research-data-discovery). This paper provides a description of the conceptual framework and encountered challenges to provide a software language-agnostic overview.
Description of workflow
The focus of this paper is the workflow, rather than a resultant static snapshot, so the description of processes and results are interwoven. The workflow is presented in a coherent fashion by virtue of scholarly communication norms, but it was developed through extensive trial-and-error and patchwork exploration (Appendix A; Gee 2025c), and certain components related to unexpected metadata and repository nuances were informed by the author’s combined experience as a former data curator at Dryad, a current data repository manager at UT Austin, and an active scientific researcher.
The workflow comprises five major search processes and two major cleaning processes (Fig. 1). In brief, the seven steps currently in the workflow are:
the primary search step involving a query to the DataCite API for ‘dataset’ records that contain one of numerous permutations of an institution’s name in one of four metadata fields;
an (optional) cross-validation step involving affiliation-based queries to select data repositories’ APIs to identify records that were not retrieved from DataCite;
a deduplication cleaning step to handle multiple DOIs for one deposit (e.g., Zenodo);
a consolidation cleaning step to handle multiple deposits, in the same repository, which supplement the same publication (e.g., Figshare);
an (optional) search step for Figshare deposits that were automatically created through integrations with partner publishers, usually without affiliation metadata, which can be linked to an institution by connecting these deposits to related publications that do have affiliation metadata;
a query to the NCBI (National Center for Biotechnology Information) Entrez system for BioProject deposits that are institutionally affiliated; and
a query to the Crossref API for ‘dataset’ records that contain some or all of the words in an institutional name in the affiliation field.
Workflow validation
The trial-and-error nature of this workflow’s development involved iterative exploration and validation of different steps. Some of this is explicit in the codebase (Gee 2025a) and integrated into the workflow description below (e.g., cross-validation of query results from DataCite’s API for select repositories against query results from those repositories’ own APIs facilitated identification of additional institutional permutations to query in DataCite’s API). In other instances, validation processes are less obvious because they were not scripted (e.g., manually comparing search results from an API against those from a web interface; examining DataCite metadata for specific deposits that are known to be affiliated or that I have published myself). Owing to space constraints on article length, additional details are provided in Appendix B (Gee 2025c).
Conceptual overview
Search approach
There are two primary avenues to institutional discovery. The first avenue is querying author names from a registry of affiliated researchers (e.g., Mannheimer et al. 2021). Limitations include: (1) the (in)ability to access personnel data and to manage such registries; (2) difficulty disambiguating records authored by researchers with common names; (3) difficulty tracking records across all organizational tiers (e.g., not all affiliated outputs include a faculty author); and (4) difficulty disambiguating work that may have been done at another institution by a presently affiliated researcher.
The second avenue is querying affiliation metadata, searching for a particular institution (e.g., Johnston et al. 2024, hereafter ‘the RADS study’). Limitations include: (1) variation in name construction (e.g., ‘UT Austin’ vs. ‘University of Texas at Austin’); (2) variation in inclusion of the parent institution in the affiliation (e.g., a researcher may list only a subsidiary unit like a medical school); (3) sensitivity of search systems (e.g., exact string match systems cannot detect a string with minor deviations from a query); and (4) variation in whether affiliation metadata are recorded and crosswalked into a schema (affiliation is not a required field in Crossref or DataCite). This workflow uses the affiliation-based method.
A third approach that warrants brief mention is examination of affiliated scholarly outputs (e.g., Briney 2023, 2024). This approach is resource-intensive because it requires maintaining a comprehensive set of records for text-mining and can be further impeded when dataset linkages are not in intuitive locations like data availability statements or when data are not shared through links (e.g., journal-hosted supplemental information [SI]); many institutions lack sufficient resources for this approach, but it could become more viable if institutions require deposition of copies of scholarly outputs in an institutional repository.
Utilized data sources
The primary data source is the publicly available DataCite REST API. The Crossref REST API is unlikely to be a major data source unless institutions have minted Crossref DOIs for data repositories (e.g., Johnston et al. 2024; UT Austin has not). For UT Austin, the Crossref API returned mostly records that are either not affiliated or that are incorrectly classified as ‘datasets.’ The inability to use Boolean operators complicates affiliation-based queries for institutions like UT Austin whose name contains many generic terms and results in ‘false positives.’ The RADS study used the Crossref public data file, which circumvents this limitation, but this dataset is not incorporated here because the file has practically quadrupled to ~200 GB since its initial release (Crossref 2020, 2025), which may be intractable for an average user to reuse. Similar considerations apply to the DataCite public data file (> 350 GB; DataCite 2024). Frequency of data retrieval will dictate these files’ viability — a one-time snapshot like the RADS study versus monthly retrieval, as is intended at UT Austin, come with different considerations (e.g., handling frequent metadata versioning; Hendricks et al. 2020; Strecker 2025). Additional publicly accessible REST APIs for specific data repositories (Dataverse; Dryad; Figshare; Zenodo) and for OpenAlex were also utilized.
Organization
This workflow is Python-based and uses the following standard library modules: datetime; io; json; math; os; shutil; urllib; and xml. Several external modules were also used: numpy (Oliphant 2006); pandas (McKinney 2011); requests (Reitz 2025); and selenium (Goucher et al. 2025). The workflow was developed in Python v3.12.5. The decision to use Python was motivated by existing internal practices and the relative popularity of Python (GitHub 2024).
The codebase comprises multiple processes (Fig. 1) across multiple scripts (Gee 2025a). The primary script includes the DataCite query and secondary targeted processes (e.g., identifying affiliated Figshare deposits without affiliation metadata). There are separate accessory scripts for proof-of-concept (e.g., a ROR-based query for comparison with a non-ROR query), data visualization, and infrequent processes (e.g., Crossref retrieval).

Figure 1: Schematic overview of the dataset discovery workflow. ‘{institution}’ represents a list of permutations of the institutional name. ‘schol. pub' (scholarly publisher) represents a list of Figshare publisher partners who mediate Figshare DOIs through DataCite, and ‘{OpenAlex code for pub.}’ is a list of corresponding OpenAlex codes for those same publishers. No affiliation parameter is included for the Dataverse API query because it is specifically querying the TDR installation. Counts represent the initial number of records returned from each query. Note that for the Figshare workflow (Step 5), the query is intentionally affiliation-agnostic, and it is assured that most records are not affiliated with UT Austin regardless of their existing metadata (or lack thereof).
Primary search workflow
Lack of standardization in affiliation metadata is the main limitation of affiliation-based queries. ROR identifiers remain relatively unadopted (the Texas Data Repository [TDR], UT Austin’s institutional data repository, only added this functionality in early 2025) and are not viable for this work (e.g., Johnston et al. 2024). For UT Austin, a ROR-based DataCite query retrieved only slightly more than 1,000 results (Table 1; Gee 2025b) skewed towards Dryad (an early adopter; Gould and Lowenberg 2019) and without any TDR deposits (n > 1,400), more than a third of which (36.6%) are functional duplicates (multiple DOIs for one deposit). A DataCite query using the official institutional name (‘The University of Texas at Austin’) returned about 1,500 results (Table 2; Gee 2025b), a similar proportion of which are functional duplicates (35.2%), and only 129 TDR datasets.
Table 1: Listing of retrieved dataset counts using a ROR-based query for affiliated research datasets in the DataCite API. Nineteen repositories are represented. Cleaning steps are the same as those of the primary workflow that is described later in this paper. Repositories with fewer than five entries in the initial retrieval are grouped together. Data as of November 21, 2025.
| Repository | Initial count | Post-cleaning count |
| Dryad | 420 | 420 |
| Zenodo | 348 | 149 |
| Figshare | 199 | 26 |
| Mendeley Data | 27 | 13 |
| NOAA NCEI | 13 | 13 |
| Harvard Dataverse | 9 | 9 |
| International Federation of Digital Seismograph Networks | 9 | 9 |
| NERC EDS UK Polar Data Centre | 8 | 8 |
| Science Data Bank | 8 | 8 |
| Other repositories | 12 | 12 |
| TOTAL | 1,053 | 667 |
Table 2: Listing of retrieved dataset counts using a single-affiliation-based query (‘The University of Texas at Austin’) for affiliated research datasets in the DataCite API. The affiliation was queried only in the creators.affiliation field. Thirty-six repositories are represented. Figshare and Figshare+ are grouped together here. Cleaning steps are the same as those of the primary workflow that is described later in this paper. Repositories with fewer than five entries in the initial retrieval are grouped together. Data as of November 21, 2025.
| Repository | Initial count | Post-cleaning count |
| Zenodo | 530 | 227 |
| Dryad | 420 | 420 |
| Figshare | 211 | 33 |
| Texas Data Repository | 129 | 129 |
| Harvard Dataverse | 55 | 55 |
| ICPSR | 22 | 9 |
| IEEE DataPort | 16 | 2 |
| Earth System Grid Federation | 13 | 13 |
| Digital Porous Media Portal | 10 | 10 |
| MassIVE | 7 | 7 |
| PhysioNet | 6 | 2 |
| NSF Arctic Data Center | 6 | 6 |
| Other repositories | 32 | 31 |
| TOTAL | 1,457 | 944 |
To identify attributes that contributed to incomplete retrieval, cross-validation was performed by querying the DataCite API for affiliated deposits in a specific repository (publisher field), making an equivalent query to that repository’s own API, and then cross-referencing the outputs. This was performed for Dryad, TDR (substituted for Harvard Dataverse), and Zenodo. The intent was to perform this process for all of the GREI repositories (except Vivli, which is for clinical data). However, Figshare and Open Science Framework (OSF) only record affiliation metadata for institutional members (some exceptions for Figshare). Mendeley Data records affiliation metadata, but its API requires a request for access. Without information on how requests are considered, this API was excluded, as it was unclear whether the API would be accessible to others.
Cross-validation identified hundreds of affiliated datasets that were retrieved by repositories’ APIs but not by the DataCite API (rarely, the inverse was identified; Appendix C; Gee 2025c). Examination of random datasets’ DataCite records identified a range of discrepancies including:
variation in non-ROR-standardized affiliations (e.g., ‘University of Texas, Austin’ is a permutation of UT Austin; Fig. 2);
affiliation metadata that are more granular than the university (e.g., Durkan and Warburton 2023, includes department) or multi-institutional (e.g., Lichtenberg et al. 2017);
affiliation metadata may be recorded in different fields; nearly 100 records listed the institution as an entity, not an affiliation (e.g., creator not creator.affiliation; Fig. 3).

Figure 2: Comparison of the frequency of different permutations of ‘The University of Texas at Austin’ among affiliated datasets retrieved through the DataCite API. Each permutation shown here occurs in at least one dataset retrieved through the affiliation-based DataCite query or the cross-validation process (n = 3,138); some permutations not shown here were included in the search query but were not detected. In some instances, the permutation shown is part of a more granular affiliation and was only detected through the cross-validation process; only the institutional part is shown here. Data as of November 21, 2025.

Figure 3: Comparison of the frequency of DataCite fields in which a permutation of UT Austin was detected. Each dataset is classified into a bin and only counted once. The classification scheme uses the hierarchy of ‘creator.affiliationName,’ ‘contributor.affiliationName,’ ‘creator.name,’ and ‘contributor.name.’ If an affiliation is detected in multiple fields for one dataset (e.g., creator.affiliationName and contributor.affiliationName), it is categorized as whichever field comes first in the hierarchy (e.g., creator.affiliationName). The hierarchy is based on the relative intuitiveness of searching a given field. The data depicted represent all datasets retrieved through the affiliation-based DataCite query and the cross-validation process (n = 3,138). Data as of November 21, 2025.
These attributes are not surprising, but they have not usually been accounted for in previous discovery workflows. Non-standardized affiliations explain the lack of discoverability of TDR datasets, which were crosswalked with affiliations in parentheses: ‘(University of Texas at Austin).’ The exact-string-match nature of the DataCite API thus precluded their discovery. Johnston et al. (2024) utilized the rdatacite R package (Chamberlain and Kramer 2023), which permits non-exact searching, but it is prone to ‘false positives’ for institutions with common terms in the name. For example, the RADS study’s dataset (Hofelich Mohr and Narlock 2024) includes datasets authored by a researcher at ‘West Virginia Institute of Technology’ that are labeled as affiliated with ‘Virginia Tech’ (e.g., Barrett et al. 2022).
To handle affiliation variability, this workflow queries 35 permutations of UT Austin across four fields: creators.affiliation; contributors.affiliation; creators.name; contributors.name. Deposits with granular affiliation metadata (e.g., ‘Department of Government, University of Texas at Austin’) cannot be efficiently retrieved, even with a multi-permutation search, due to their specificity and collective count (there are limits on the number of affiliations in one DataCite query), so deposits with these kinds of metadata that were identified through cross-validation were individually re-queried through the DataCite API using their DOIs.
What about code?
This workflow can retrieve software (and any combination of resource types), but this paper focuses on datasets for several reasons. Firstly, there is greater cultural and infrastructural emphasis (e.g., federal funding policies) on data sharing, leading to lagging rates of code sharing (e.g., Culina et al. 2020; Hamilton et al. 2022, 2023; Kambouris et al. 2024; Maitner et al. 2024; Sharma et al. 2024). Secondly, fewer researchers will utilize or generate code than will utilize or generate data. Finally, most research software lacks DOIs. UT Austin-affiliated ‘software’ publications only number 436 deposits with DataCite DOIs between just seven repositories, almost all of which are in Zenodo (Table 3; Gee 2025b); by comparison, separate work (Gee and Shensky 2025) has identified over 1,600 purportedly affiliated GitHub accounts with over 35,000 collective repositories.
Table 3: Summary of affiliated software publications from the general affiliation-based DataCite query. *The entry that was reclassified as being from the Department of Energy’s CODE repository originally listed the publisher as ‘University of Texas at Austin,’ the affiliation of the second author; this was edited after manual inspection and is not counted in the total initial count. A single Figshare deposit labeled as ‘software’ was identified from the secondary Figshare workflow. Data as of November 21, 2025.
| Repository | Initial count | Post-cleaning count |
| Zenodo | 1,535 | 406 |
| Code Ocean | 28 | 21 |
| brainlife.io | 3 | 3 |
| Sandia National Laboratory | 3 | 3 |
| Department of Energy CODE | 0* | 1 |
| Pacific Northwest National Laboratory | 1 | 1 |
| TOTAL | 1,571 | 435 |
Secondary search workflows
The Crossref API query is externalized as an accessory script due to the low volume of relevant results. Beyond datasets that are discoverable through affiliation-based queries to DataCite or Crossref, there are two other categories of datasets: (1) deposits that have DOIs but that lack affiliation metadata; and (2) deposits that either use a PID system that is not centrally searchable (e.g., ARKs, handles) or that have no PID but that contain affiliation metadata (e.g., GitHub). This section also describes secondary workflows that can (partially) discover these deposits.
Crossref API
This accessory script queries ‘datasets’ affiliated with ‘university+of+texas+austin.’ Just 0.1% of the more than 660,000 retrieved DOIs are truly affiliated (n = 554; Table 4; Gee 2025b); the rest are ‘false positives’ that are predominantly affiliated with another UT campus or with an organization with ‘Austin’ in the name, (e.g., Stephen F. Austin University). As with Johnston et al. (2024), affiliated H1 Connect (formerly Faculty Opinions LTD; ~83%) deposits are peer reviews improperly labeled as ‘datasets’ (‘peer review’ has been a resource type since 2018; Lin 2018). The ENCODE Data Coordination Center is also well-represented (~15%). All affiliated ENCODE datasets were deposited by a single researcher, but, as with Johnston et al. (2024), this study was unable to identify a reliable means of consolidating them. The remaining seven platforms had fewer than 10 deposits each. Those that were examined and found not to be datasets (American College of Radiology [case scripts]; Wiley [Authorea preprints]; Society of Exploration Geophysicists [podcast]; and NumFOCUS — Insight Software Consortium (ITK) [poster] were removed.
Table 4: List of all repositories with 50 or more purportedly UT Austin-affiliated datasets retrieved through Crossref. Forty-eight platforms were included in the retrieval (some are name duplicates); entries recorded as ‘H1 Connects’ and ‘Faculty Opinions Ltd’ are combined, and entries recorded as ‘USDA Forest Service’ and ‘Forest Service Research Data Archive’ are combined. ‘BCO-DMO’ refers to the Biological and Chemical Oceanography Data Management Office; 'EMBL-EBI' refers to the European Molecular Biology Laboratory - European Bioinformatics Institute. Entries listed as ‘Wiley’ are Authorea preprints based on random spot-checking. Data as of November 21, 2025.
| Repository | Repository type | Initial count | Post-cleaning count |
| ENCODE Data Coordination Center | Specialist | 503,426 | 82 |
| H1 Connect | Not data repository | 155,419 | 0 |
| Wiley | Not data repository | 1,865 | 0 |
| EMBL-EBI | Specialist | 1,041 | 0 |
| BCO-DMO | Specialist | 902 | 2 |
| USDA Forest Service | Specialist | 341 | 1 |
| Jyvaskyla University Library | Institutional repository | 245 | 0 |
| American College of Radiology | Not data repository | 172 | 0 |
| CABI Publishing | Not data repository | 129 | 0 |
| Boise State University | Institutional repository | 92 | 1 |
| The Hong Kong University of Science and Technology Library | Institutional repository | 82 | 0 |
| All other repositories | Mixed | 263 | 0 |
| TOTAL | 663,977 | 86 |
Affiliated datasets with DOIs but without affiliation metadata: the Figshare case study
Figshare typically only records affiliation metadata for institutional members, but there are exceptions. UT Austin and five of the six RADS institutions are not members, but this workflow and Johnston et al. (2024) recovered affiliated Figshare deposits. Inspection of these datasets revealed that virtually all datasets listing ‘figshare’ as the publisher are supplemental to Springer Nature articles (Table 5; Gee 2025b). Digital Science, which develops Figshare, and Springer Nature are both part of the Holtzbrinck Publishing Group, and Springer Nature employs a workflow in which files uploaded as ‘supplemental information’ in manuscript portals are automatically published on Figshare upon article publication. Other partners using this workflow include Frontiers, PLOS, and Taylor & Francis (Figshare, n.d.), but Figshare deposits mediated through these publishers are indexed differently, either because the DOI is minted through Crossref (e.g., PLOS) or because the DataCite publisher field lists the scholarly publisher (e.g., a deposit mediated through a Taylor & Francis journal will list ‘Taylor & Francis’ as the publisher); Springer Nature is the only one to list the publisher as ‘figshare.’ This indexing thus explains the handful of recovered datasets where the publisher is a scholarly publisher. To better understand this process, metadata were examined from a random subset of recent deposits for each publisher partner (Gee 2025b). Identification of certain trends (e.g., listed publisher; whether a set of files receives one DOI, or each file receives a DOI) then facilitated development of workflows to identify affiliated Figshare deposits without affiliation metadata.
Table 5: Comparison of journals associated with RADS Figshare deposits. The RADS dataset (Hofelich Mohr and Narlock, 2024) was filtered for objects labeled as ‘dataset’ and with ‘figshare’ listed as the publisher (omitting any deposits mediated through a non-Springer Nature partner; n = 2,276) and then cleaned in the same fashion as in the workflow of this paper, resulting in 1,040 DOIs after removal of versions. Among these 1,040 DOIs, there are 268 unique article DOIs listed as being supplemented by a Figshare deposit, across 86 unique journal titles (data in Gee 2025b).
| Journal | Publisher | Count |
| BMC Plant Biology | Springer Nature | 27 |
| BMC Genomics | Springer Nature | 22 |
| Genome Biology | Springer Nature | 14 |
| BMC Biology | Springer Nature | 11 |
| BMC Bioinformatics | Springer Nature | 10 |
| BMC Cancer | Springer Nature | 7 |
| Genome Medicine | Springer Nature | 7 |
| Molecular Cancer | Springer Nature | 7 |
| Journal of Experimental & Clinical Cancer Research | Springer Nature | 6 |
| Journal of Neuroinflammation | Springer Nature | 6 |
| Journal of Translational Medicine | Springer Nature | 6 |
| Microbiome | Springer Nature | 6 |
| BMC Medicine | Springer Nature | 5 |
| Cell & Bioscience | Springer Nature | 5 |
| Journal of Hematology & Oncology | Springer Nature | 5 |
| Stem Cell Research & Therapy | Springer Nature | 5 |
| All other Springer Nature titles (n = 68) | Springer Nature | 118 |
| CABI Agriculture and Bioscience | CABI Publishing | 1 |
| TOTAL | 268 |
Non-Springer Nature partners with mediated DataCite DOIs do not maintain data repositories, so any DataCite dataset that lists a publisher partner as the publisher is inferred to be a mediated Figshare deposit. These deposits’ metadata usually include the relationship to an article (relationType: ‘IsSupplementTo’) and that article’s DOI. Using these attributes, this workflow (Fig. 4) queries DataCite for ‘datasets’ where the publisher is a partner known to mediate deposits through DataCite (e.g., Taylor & Francis). Next, it queries the OpenAlex API for affiliated journal articles from those publishers. Cross-referencing these dataframes for shared article DOIs then links mediated Figshare deposits without affiliation metadata to UT Austin-affiliated articles (Fig. 4). The number of datasets gained through this process is not large in absolute terms (< 150 datasets; Fig. 5), but it represents a nearly five-fold increase from the affiliation-based DataCite query (32 vs. 125).

Figure 4: Schematic diagram showing the secondary workflow to discover affiliated Figshare deposits that were mediated through a publisher partner and that usually lack affiliation metadata. The code listed in the locations.source.host_organization field for the OpenAlex API is the OpenAlex code for Taylor & Francis. In the full workflow, a list of publisher partners’ names and OpenAlex codes is looped in a single API call.

Figure 5: Comparison of the number of additional Figshare DOIs without affiliation metadata that were identified through connections with affiliated articles. Counts represent post-deduplication counts (n = 125). Data as of November 21, 2025.
Figshare deposits with Crossref DOIs are significantly harder to retrieve due to lower-quality metadata (e.g., no metadata linkage to article) and to over-splitting where each supplemental file for a single article receives its own Figshare DOI. The volume created by over-splitting is a particular impediment; there are over 4,000,000 PLOS-mediated ‘components’ (the Crossref label for their SI-to-Figshare deposits). For UT Austin alone, the 640 affiliated PLOS articles with at least one SI file collectively include 3,883 SI files with individual DOIs (the most SI DOIs for one article is 38; Gee 2025b). Three PLOS-specific workflows have been tested (Appendix D; Gee 2025c), but a publisher-agnostic, long-term workflow remains to be developed.
Affiliated deposits without DOIs: the NCBI case study
Datasets are often disseminated without a DOI, which can include deposition in data repositories that do not issue DOIs. For example, NCBI repositories (e.g., GenBank) only issue accession and project numbers. However, repositories like these that do not use DOIs may record affiliation metadata in an internally standardized fashion that permits affiliation-based queries within the platform.
To this end, NCBI was targeted because of the centralization of bioinformatics data. It will be a significant source of data for any institution with appreciable NIH (National Institutes of Health) funding. Affiliation metadata are searchable in the web interface, and records can be programmatically accessed through the Entrez system through a command-line tool (E-Utilities) or a third-party library (e.g., biopython, eutils; Cock et al. 2009; Stevenson et al. 2019). This section describes two approaches to retrieve the nearly 1,000 UT Austin-affiliated BioProjects (closest approximation of a ‘dataset’; Gee 2025b).
The first approach queries Entrez with biopython (query: university+of+texas+austin), which is straightforward but reliant on maintenance of the module and on adhering to the NCBI rate limit (National Center for Biotechnology Information 2025). The second approach uses the Selenium Webdriver, a web development tool that scripts actions within a web. This approach simulates a user manually searching and downloading the results file from the web interface and is faster than biopython, but it relies on HTML stability and adhering to rate limiting.
Cleaning steps
A significant component of this workflow is the cleaning and standardization of retrieved deposits to handle inter-repository variation like granularity in PID assignment. For example, Dataverse-based repositories can mint a DOI for each file but maintain the DOI (file or dataset) across all versions, whereas Zenodo only mints DOIs for datasets but issues a new DOI for each version. This workflow removes Dataverse file-level DOIs (as with Johnston et al. 2024), which have an alphanumeric suffix appended to the dataset DOI. Most initial records were removed; of the >72,000 affiliated TDR DOIs, only about 1,500 are proper datasets. The workflow also removes Dryad file-level DOIs (addition of ‘/*’ to the dataset DOI, where ‘*’ is an integer; this is not a current practice).
Zenodo mints a ‘parent’ that resolves to the most recent version and a ‘child’ for each version. This workflow handles multi-DOI Zenodo deposits (as with Johnston et al., 2024) using deduplication on multiple fields to retain only the ‘parent.’ Figshare, the Inter-university Consortium for Political and Social Research (ICPSR), and Mendeley Data use the ‘parent-child’ system but construct ‘child’ DOIs differently (they append ‘v*’, where ‘*’ is an integer, to the ‘parent’ DOI). This workflow departs from Johnston et al. (2024) in retaining only the ‘parent’ DOI for these repositories. It further differs in handling instances when files for a single manuscript were each given a mediated Figshare DOI (over-splitting); all Figshare datasets supplementing the same relatedIdentifier are consolidated into one entry. For comparison, if these cleaning steps were applied to the RADS dataset (Hofelich Mohr and Narlock 2024), the resultant Figshare count for most institutions would be less than 15% of the reported count (Fig. 6; Gee 2025b).

Figure 6: Comparison of reported dataset counts for RADS institutions from Johnson et al. (2024) versus counts after removal of redundant version DOIs and consolidation of Figshare dataset deposits. The proportion is scaled against the total number of Figshare DOIs for each institution (when a dataset lists multiple RADS institutions, it is counted for each one). ‘Versions removed’ removed any DOI that ended in ‘.v*’ (i.e. only retaining the ‘parent’). This proportion should be no more than 50% but could be lower (e.g., Duke) if there are datasets with more than one version. ‘Consolidation’ deduplicated entries that shared an affiliated institution, publicationDate, and the entire relatedIdentifier field; this represents the number of unique articles supplemented by these Figshare deposits, which accounts for over-splitting of some mediated Figshare deposits (one DOI for each file associated with the same manuscript). Institutions with relatively low post-consolidation proportions have a relatively high file:article ratio (i.e. more over-split). The summary table with exact counts is included in the dataset (Gee 2025b).
Repository names were also cleaned, such as permutations of the same platform (e.g., ‘Dryad Digital Repository’ vs. ‘Dryad’) and instances where the repository is not listed as the publisher (e.g., most mediated Figshare deposits; platforms where depositors can edit this field, like Zenodo). Additional cleaning of smaller-volume repositories and of individual datasets was necessary (Appendix E; Gee 2025c), but specifics will vary by institution.
UT Austin snapshot
As of November 21, 2025, the post-cleaning (final) corpus includes over 4,300 UT Austin-affiliated datasets across 73 repositories (Table 6; Gee 2025b). As with Johnston et al. (2024), the institutional repository (TDR) and certain generalist repositories are the best-represented (Table 7; with OSF conspicuously absent). Some specialist repositories also have appreciable counts, including DesignSafe, a natural hazards repository that is maintained at the Texas Advanced Computing Center (TACC), a subsidiary of UT Austin; ICPSR (a large social science repository); MassIVE, a mass spectrometry repository at the University of California, San Diego; and the Environmental Molecular Sciences Laboratory (EMSL), a biological and environmental sciences research facility.
Table 6: Summary results of the five sources of affiliated research datasets that were utilized in this workflow. ‘DataCite API’ represents the primary affiliation-based query. ‘Specific repositories’ APIs’ represents additional datasets that were only discovered through the cross-validation process. ‘DataCite + OpenAlex APIs’ represents additional Figshare datasets that were identified through the DataCite Figshare secondary workflow. The initial count is intentionally affiliation-agnostic (searching for all datasets listed for a scholarly publisher), and it is assured that most deposits are not institutionally affiliated. ‘NCBI’ represents BioProject records. ‘Crossref API’ represents the affiliation-based query in this platform. Initial count represents the initial corpus of records, and the post-cleaning count represents the corpus after cleaning and deduplication. For ‘Specific repositories’ APIs’ and ‘DataCite + OpenAlex APIs,’ the post-cleaning count only includes records that were retained after combining with the general ‘DataCite API’ output and removal of duplicates (retaining those retrieved from the general API query); in other words, the post-cleaning counts reflect the records that could only be identified through these processes. Data as of November 21, 2025.
| Source | Initial count | Post-cleaning count |
| DataCite API | 77,078 | 3,116 |
| Specific repositories’ APIs | 3,582 | 21 |
| DataCite + OpenAlex APIs | 163,253 | 125 |
| NCBI | 996 | 953 |
| Crossref API | 663,977 | 86 |
| TOTAL | 908,886 | 4,301 |
Table 7: List of all repositories with 40 or more UT Austin-affiliated datasets. The numerical threshold is applied for the post-cleaning counts. Initial counts reflect only records retrieved in the primary affiliation-based DataCite query (except for NCBI, ENCODE [Crossref], and Figshare [secondary]), whereas the post-cleaning count includes records from all sources of data. For Figshare, the numbers of deposits recovered from the affiliation-based query and the secondary Figshare workflow are separated. The initial Figshare (primary) count includes mediated Figshare deposits that list a publisher known to be a Figshare partner (e.g., Taylor & Francis). For the initial Figshare (secondary) count, the initial count is intentionally affiliation-agnostic (searching for all datasets listed for a scholarly publisher), and it is assured that most deposits are not institutionally affiliated. Figshare and Figshare+ are combined here. For Dryad, the initial count is lower than the post-cleaning count because one dataset was not retrieved from DataCite’s API due to UT Austin metadata only being in the Dryad API (Appendix C; Gee 2025c). ‘Env. Mol. Sci. Lab.’ is the Environmental Molecular Science Laboratory. Data as of November 21, 2025.
| Repository | Repository type | Initial count | Post-cleaning count |
| Texas Data Repository | Institutional | 72,070 | 1,487 |
| NCBI | Domain | 996 | 953 |
| Dryad | Generalist | 423 | 424 |
| Zenodo | Generalist | 828 | 390 |
| Harvard Dataverse | Generalist/Institutional | 1,185 | 311 |
| Figshare (secondary) | Generalist | 163,253 | 125 |
| ICPSR | Domain | 249 | 93 |
| ENCODE | Domain | 503,426 | 82 |
| MassIVE | Domain | 81 | 81 |
| DesignSafe | Domain | 199 | 56 |
| Env. Mol. Sci. Lab | Domain | 1,469 | 42 |
| Figshare (primary) | Generalist | 193 | 32 |
| All other repositories | Mixed | 160,504 | 226 |
| TOTAL | 742,176 | 4,301 |
Metadata assessments
One of the intended uses of discovered datasets is examining topics such as use of UT Austin’s institutional data repository (TDR); frequency of large data (e.g., datasets over 50 GB); file format prevalence; and frequency of publishing of software in/as ‘datasets,’ and in turn, to use this information for research data service development. This section describes two existing metadata assessments (see also Appendix F; Gee 2025c). Because DataCite and Crossref have few required fields, many assessments can only be performed for a subset of datasets.
Object classification
Accuracy of metadata labels is a major challenge for DOI-backed deposits; some ‘datasets’ are not truly datasets (e.g., H1 Connect peer reviews), and non-‘dataset’ objects can contain data (e.g., the ‘components’ for PLOS SI). Assessing deposits’ ‘true nature’ with other metadata fields is not always reliable (or possible). For example, file format can be misleading because researchers sometimes publish data in suboptimal formats (e.g., tabular data in PDFs), and relative optimality of formats varies by discipline and data type (e.g., PDFs are optimal for qualitative data like transcripts). Filename information can be confounded by variable usage of terms like ‘appendix’ and ‘supplemental information.’ An overarching challenge is interdisciplinary variation in concepts of ‘data’ (e.g., Guy et al. 2013; Gualandi et al. 2022). Nonetheless, certain rules can be implemented (e.g., identifying unambiguous “not data” formats like software).
For UT Austin, 287 ‘datasets’ include a software format and a non-software format; 23 datasets exclusively comprise software (Fig. 7; Gee 2025b). These collectively represent ~21% of the deposits with file-level metadata in DataCite’s API, demonstrating the frequency of mixed-media deposits and the resultant challenges of accurate programmatic retrieval based on object type. Given researchers’ lack of familiarity with metadata schemas and tendency to publish mixed-media deposits, developing alternative means of characterizing research outputs is essential (e.g., text-mining articles; Howison and Bullard 2016; Zhao et al. 2018; Du et al. 2021; Istrate et al. 2022, 2024; Schindler et al. 2022; Pan et al. 2023; Druskat et al. 2024).

Figure 7: Count of affiliated datasets based on whether one or more software formats was detected in a dataset’s metadata. The data depicted represent all ‘datasets’ retrieved in the affiliation-based DataCite query and the cross-validation with specific repositories’ APIs (n = 3,138). Red bars represent datasets with at least one software format; the blue bar represents datasets that do not contain a (readily identifiable) software format. Non-code files are not necessarily ‘data.’ As shown through the gray bar, many datasets do not contain any information on file formats. Data as of November 21, 2025.
Authorship contribution
Being listed as a dataset author does not necessarily imply any role in dataset publication. Given variability in determination of authorship (e.g., Holcombe, 2019), a researcher’s role in dataset publication is often unclear. However, inferring this role is important for data interpretation; for example, higher publishing volume in one repository versus another could suggest relative preference by an institution’s researchers, but differences could alternatively result from external collaborators’ preferences (among other explanators). Here, the first and last authors’ affiliations are used as one coarse estimator of meaningful contributions by an affiliated researcher (Fig. 8).

Figure 8: Comparison of authorship position of UT Austin authors on affiliated datasets. Orange demarcates any category in which a UT Austin researcher is lead and/or senior author; blue demarcates when a UT Austin researcher is neither lead nor senior; and gray (only NCBI) results from the different metadata schema in which specific authorship is less clear. The data represent all datasets across all sources (n = 4,301). Data as of November 21, 2025.
Comparison of repositories through this lens reveals some interesting differences (Fig. 9). Dryad deposits are less likely to have a UT Austin author in the lead and/or senior position than TDR or Harvard Dataverse. Because Dryad charges for data publication, UT Austin researchers are probably less likely to choose Dryad unless fees are covered by a partner journal or a coauthor from an institutional member. However, Zenodo deposits (no fees) also show a lower frequency of a lead and/or senior UT Austin researcher, so cost is not the only explanator. The most interesting result is the high proportion of lead and/or senior authors for Harvard Dataverse deposits (contextualized below).

Figure 9: Comparison of the frequency of datasets with a UT Austin researcher listed in the lead or senior author positions among select repositories. The selected repositories are based on those with the highest number of UT Austin-affiliated datasets (excluding NCBI). Orange includes any category in which a UT Austin researcher is lead and/or senior author; blue demarcates when a UT Austin researcher is neither lead nor senior. The data represent all datasets across all sources (n = 4,301). Data as of November 21, 2025.
Discussion
UT Austin overview
General patterns for UT Austin (majority of deposits in institutional data repository [TDR] and large generalists) are similar to the RADS institutions (Johnston et al. 2024). These high-level data can be further interrogated to identify trends among researchers (e.g., comparison of annual publication volume; Fig. 10). Sustained growth in Harvard Dataverse usage is particularly interesting in tandem with the author contribution inference (Fig. 9) since TDR is also built on Dataverse software (i.e. there is no obvious “advantage” of Harvard’s repository). TDR has itself seen strong growth (ideally related to institutional promotion), with volatility related to over-splitting that can occur in Dataverse repositories (materials split into multiple ‘datasets’ in one ‘dataverse’; Appendix E; Gee 2025c). Zenodo usage is also growing quickly; partnership with Dryad (Lowenberg 2021); growing academic use of GitHub (Färber 2020; Escamilla et al. 2022, 2023; GitHub 2024) and the GitHub-Zenodo integration; participation in GREI; and large file size limit (50 GB) likely contribute to its popularity. In contrast, Dryad presents a flat pattern; this trend could reflect stable publishing volume in certain partner journals that cover the publishing fee and/or sustained collaboration with researchers at member institutions. The NCBI pattern is uneven and difficult to explore since NCBI uses a different schema; the role of affiliated researchers cannot be inferred in the same way (Fig. 10).

Figure 10: Comparison of the annual volume of UT Austin-affiliated dataset publications in select repositories among select repositories (2014–2024). The selected repositories are based on those with the highest number of UT Austin-affiliated datasets, draws from the full corpus of affiliated datasets, and excludes ‘software’ deposits. Data as of November 21, 2025.
FAIR for whom: open data, closed metadata
Community efforts to develop best practices for data sharing often focus on discovery and reuse by other researchers. From this perspective, it is not surprising that repositories neither uniformly nor consistently record affiliation metadata because these metadata are unlikely to be utilized by a researcher looking for data. Nonetheless, similar to how there is broad recognition of the importance of sharing research outputs in a fashion that permits broad reuse beyond just specialists, there should also be community emphasis on ensuring broad findability by entities beyond researchers.
Discovery workflows are intrinsically limited by upstream repository infrastructure and processes (e.g., Wu et al. 2019; Gregory et al. 2020; Löffler et al. 2021), and DOI-backed publication in a repository does not guarantee affiliation-based discovery (e.g., mediated Figshare deposits; various specialist repositories). Even metadata-conscious depositors cannot overcome a lack of metadata support (e.g., a researcher at a non-member institution cannot add affiliations to an OSF deposit). Certain well-known repositories have made public commitments to, and received extensive federal funding for, implementation of metadata standards but do not adhere to these in practice. Van Gulick et al. (2024) stated, “repositories need to collect information about the dataset producer and their affiliation.” That GREI repositories like Figshare and OSF only collect this information for subscribing institutions hardly seems sufficient to achieve this aim. Similarly, Curtin et al. (2023) stated that “we [GREI] also hope this common metadata schema will be useful for data repositories beyond GREI to improve interoperability across data repositories and across the NIH data landscape.” It is difficult to envision a scenario in which other repositories feel compelled to adopt GREI recommendations that are not even adhered to by some GREI repositories.
Institutional repository memberships in which institutional discovery is touted as a benefit operate on the implicit premise that it is impossible to easily or systematically track affiliated outputs without membership, which results in numerous shortcomings for non-members. Neither this workflow nor Johnston et al. (2024) recovered any OSF datasets, for example. There is a certain irony that repositories committed to high-quality, open data do not demonstrate the same commitment to high-quality, open metadata. If datasets are not findable at an institutional scale, without membership-based tracking functionality, I would argue that they are not really FAIR. Efforts to study and improve data sharing practices are not limited to researchers; there is also a need for critical interrogation of repositories’ practices, which severely limit discovery workflows.
Adaptation for use at other institutions
The codebase (Gee 2025a) is designed for easy adoption by another institution, with institution-specific parameters defined in the configuration file. A re-user would simply need to identify expected permutations of their institutional name to be queried and substitute them into the existing dictionary with permutations for UT Austin; a few other parameters (e.g., the abbreviated name to include in filenames) also need to be edited. If a re-user wants to perform the cross-validation process, either with only the Zenodo API or with both Zenodo and a Dataverse-based repository, they will need to obtain a free API key by creating an account with these platforms and enter that into the configuration file. The only significant time investment involved in re-use of this workflow is the need to manually inspect the outputs to discern whether there are edge cases specific to a given institution (e.g., a utilized repository with unexpected metadata formatting that has not been used by UT Austin researchers and is thus not handled presently). Re-use instructions are further provided in the code’s README (Gee 2025a). As detailed below, this workflow is intended to be continually developed to improve coverage and to increase utility.
Future directions
The primary focus for this workflow is increasing coverage through targeted workflows of undetected repositories that are known to be used by affiliated researchers. For example, UT Austin is an institutional member of the Qualitative Data Repository (QDR; Karcher et al. 2016), but this workflow did not retrieve any QDR datasets (Gee 2025b). Investigation of DataCite records for affiliated deposits identified through the web interface (e.g., Madrid 2016) revealed that granular metadata prevent discovery. Another example is MorphoBank (O’Leary and Kaufman 2011; Long-Fox et al. 2024), a phylogenetic repository with known usage by UT Austin researchers (e.g., Parker et al. 2022); a lack of affiliation crosswalking prevents discovery. A final example is MorphoSource (Boyer et al. 2016), a computed tomographic data repository that is used by the University of Texas Computed Tomography (UTCT) lab. Many MorphoSource datasets are not associated with a specific article, including UTCT datasets (nearly 3,500 deposits, most without DOIs) generated through the openVertebrate Thematic Collections Network project (Blackburn et al. 2024), but these outputs are also of interest for understanding institutional outputs.
Another future objective is identifying metadata discrepancies and, where possible, remediating them. The constant evolution of schemas and best practices for metadata means that continual metadata maintenance is necessary to ensure datatsets’ FAIRness. Repositories need to not only look forward in order to adopt practices for data that are yet to come but also backward in order to maintain existing data. Additionally, because researchers are unlikely to update datasets simply to enhance general metadata (e.g., adding ROR identifiers), responsibility for metadata maintenance (re-curation; e.g., Habermann 2023, 2024) often becomes a community responsibility for repositories and data stewards (e.g., the Collaborative Metadata Enrichment Taskforce [COMET]; Buttrick et al. 2025). Discrepancies in affiliation metadata that were identified through this workflow have already led to TDR improvements (e.g., formatting of affiliation metadata), with additional curation planned (e.g., programmatically adding ROR identifiers to existing datasets).
Finally, research data scholarship and services are an emerging focus within libraries, and broader internal sharing data on dataset publications with other units can facilitate improved awareness of research activities. At UT Austin, a separate scripted process is used to provide subject liaisons with monthly updates on recently published TDR datasets. As certain disciplines are more likely than others to use TDR, this notification process has been expanded to cover all recently published discoverable datasets in order to provide more comprehensive information for liaisons.
Post-acceptance author notice
One of the core barriers to institutional discovery that was noted in this study is the variable construction of an institution’s name in dataset metadata, as well as more granular affiliation metadata (e.g., inclusion of department), in tandem with limitations of the DataCite API, which relies on exact string matches or imprecise wildcard searches. At nearly the same time as the acceptance of this manuscript, DataCite released new functionality in the API (Ross 2025) to permit more flexible search queries (in select metadata fields) that are case-insensitive and that can return all results that contain, but do not exactly match, a set of defined terms. For example, ‘university\ of\ texas\ austin’ will return all results that contain those four words in the specified field(s), regardless of whether they occur in that exact order or whether additional words are included. Early testing of this functionality for UT Austin demonstrated that the more flexible query could retrieve deposits with granular affiliation metadata that were not previously detected (e.g., Madrid [2016] is a dataset in the Qualitative Data Repository [QDR] that lists the affiliation as ‘Department of Government, University of Texas at Austin). This expanded functionality is not a perfect solution — flexible searches cannot be applied to creator or contributor name fields, for example, and thus cannot handle instances where institutions are listed as individual entities rather than affiliations (a known occurrence), and additional queries are necessary to handle instances where part of an institution’s name is abbreviated (e.g., ‘UT’ for ‘University of Texas’), but this new functionality addresses at least some of the challenges noted in this article (e.g., granular affiliation metadata) and provides for increased retrieval capacity. Early testing in the context of UT Austin suggests that the number of previously unidentified datasets that can now be programmatically identified does not increase significantly with the new flexible query, but this new search capability does increase the comprehensiveness of affiliated deposit retrieval nonetheless.
References
Alawiye, Rhoda, and Danielle Kirsch. 2024. “Using Publicly Available Metadata to Analyze Data Sharing Practices at Oklahoma State University.” Presentation. All Things Open, Virtual, April 16. https://digitalcommons.kennesaw.edu/ato/2024allthingsopen/presentations/12.
Barrett, Craig F., Mathilda V. Santee, Nicole M. Fama, John V. Freudenstein, Sandra J. Simon, and Brandon T. Sinn. 2022. “Lineage and Role in Integrative Taxonomy of a Heterotrophic Orchid Complex.” Dataset. Zenodo. https://doi.org/10.5281/zenodo.5949346.
Barsky, Eugene, John Brosz, and Amber Leahey. 2016. “Research Data Discovery and the Scholarly Ecosystem in Canada: A White Paper.” University of British Columbia. https://doi.org/10.14288/1.0307548.
Benjelloun, Omar, Shiyu Chen, and Natasha Noy. 2020. “Google Dataset Search by the Numbers.” In The Semantic Web – ISWC 2020, edited by Jeff Z. Pan, Valentina Tamma, Claudia d’Amato, et al. Springer Nature. https://doi.org/10.1007/978-3-030-62466-8_41.
Blackburn, David C., Doug M. Boyer, Jaimi A. Gray, et al. 2024. “Increasing the Impact of Vertebrate Scientific Collections through 3D Imaging: The openVertebrate (oVert) Thematic Collections Network.” BioScience 74 (3): 169–186. https://doi.org/10.1093/biosci/biad120.
Borgman, Christine L. 2012. “The Conundrum of Sharing Research Data.” Journal of the American Society for Information Science and Technology 63 (6): 1059–1078. https://doi.org/10.1002/asi.22634.
Boyer, Doug M., Gregg F. Gunnell, Seth Kaufman, and Timothy M. McGeary. 2016. “MorphoSource: Archiving and Sharing 3-D Digital Specimen Data.” The Paleontological Society Papers 22: 157–181. https://doi.org/10.1017/scs.2017.13.
Briney, Kristin. 2023. “Where’s the Data? An Analysis of Links to Shared Data from Articles Published by a Single University.” Presentation. Research Data Access and Preservation (RDAP) Annual Meeting, Virtual. OSF, March 30. https://osf.io/967zw.
Briney, Kristin A. 2024. “Measuring Data Rot: An Analysis of the Continued Availability of Shared Data from a Single University.” PLOS ONE 19 (6): e0304781. https://doi.org/10.1371/journal.pone.0304781.
Buttrick, Adam, John Chodacki, Juan Pablo Alperin, et al. 2025. “The COMET Model: Transitioning to Community-Curated PID Metadata Enrichment.” Report. Zenodo. https://zenodo.org/records/15882315.
Chamberlain, Scott, and Bianca Kramer. 2023. rdatacite: Client for the “DataCite” API. V. 0.5.4. Released February 5. https://doi.org/10.32614/CRAN.package.rdatacite.
Chapman, Adriane, Elena Simperl, Laura Koesten, et al. 2020. “Dataset Search: A Survey.” The VLDB Journal 29 (1): 251–272. https://doi.org/10.1007/s00778-019-00564-x.
Cock, Peter J. A., Tiago Antao, Jeffrey T. Chang, et al. 2009. “Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics.” Bioinformatics 25 (11): 1422–1423. https://doi.org/10.1093/bioinformatics/btp163.
Crosas, Mercè. 2011. “The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data.” D-Lib Magazine 17 (1/2). https://doi.org/10.1045/january2011-crosas.
Crossref. 2020. “March 2020 Public Data File from Crossref.” Academic Torrents. http://doi.org/10.13003/83b2gp.
Crossref. 2025. “March 2025 Public Data File from Crossref.” Academic Torrents. https://doi.org/10.13003/87bfgcee6g.
Culina, Antica, Ilona van den Berg, Simon Evans, and Alfredo Sánchez-Tójar. 2020. “Low Availability of Code in Ecology: A Call for Urgent Action.” PLOS Biology 18 (7): e3000763. https://doi.org/10.1371/journal.pbio.3000763.
Curtin, Lisa, Lorenzo Feri, Julian Gautier, et al. 2023. GREI Metadata and Search Subcommittee Recommendations_V01_2023-06-29. Report. Zenodo. https://doi.org/10.5281/zenodo.8101956.
DataCite. 2024. “DataCite Public Data File.” DataCite, March 21. https://doi.org/10.14454/ZHAW-TM22.
Dellureficio, Anthony, and Klara Pokrzywa. 2024. “Elevating the Role of Research Data in Scholarly Communication.” Presentation. Research Data Access and Preservation (RDAP) Annual Meeting, Virtual. OSF, March 4. https://osf.io/ter8c.
Druskat, Stephan, Neil P. Chue Hong, Sammie Buzzard, Olexandr Konovalov, and Patrick Kornek. 2024. “Don’t Mention It: An Approach to Assess Challenges to Using Software Mentions for Citation and Discoverability Research.” Preprint, arXiv, February 22. https://doi.org/10.48550/arXiv.2402.14602.
Du, Caifan, Johanna Cohoon, Patrice Lopez, and James Howison. 2021. “Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications.” Journal of the Association for Information Science and Technology 72 (7): 870–884. https://doi.org/10.1002/asi.24454.
Durkan, Leanne, and Niels Warburton. 2023. “Example Calculation of E1[H1] Contribution to the Source for Second-Order Metric Perturbations of a Schwarzschild Black Hole.” Dataset. Zenodo. https://doi.org/10.5281/zenodo.8405797.
Elsevier. 2025. “Sunset of Data Monitor.” Pure Help Center for Pure Administrators, April 21. https://web.archive.org/web/20250421024102/https://helpcenter.pure.elsevier.com/en_US/data-sources-and-integrations/sunset-data-monitor.
Escamilla, Emily, Martin Klein, Talya Cooper, Vicky Rampin, Michele C. Weigle, and Michael L. Nelson. 2022. “The Rise of GitHub in Scholarly Publications.” In Linking Theory and Practice of Digital Libraries, edited by Gianmaria Silvello, Oscar Corcho, Paolo Manghi, et al. Springer Nature. https://doi.org/10.1007/978-3-031-16802-4_15.
Escamilla, Emily, Lamia Salsabil, Martin Klein, Jian Wu, Michele C. Weigle, and Michael L. Nelson. 2023. “It’s Not Just GitHub: Identifying Data and Software Sources Included in Publications.” In Linking Theory and Practice of Digital Libraries, edited by Omar Alonso, Helena Cousijn, Gianmaria Silvello, Mónica Marrero, Carla Teixeira Lopes, and Stefano Marchesin. Springer Nature. https://doi.org/10.1007/978-3-031-43849-3_17.
Färber, Michael. 2020. “Analyzing the GitHub Repositories of Research Papers.” JCDL ’20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, 491–492. https://doi.org/10.1145/3383583.3398578.
Figshare. n.d. “Who We Work With.” Figshare. https://web.archive.org/web/20250531040121/https://info.figshare.com/working-with.
Franceschini, Fiorenzo, Domenico Maisano, and Luca Mastrogiacomo. 2016. “Empirical Analysis and Classification of Database Errors in Scopus and Web of Science.” Journal of Informetrics 10 (4): 933–953. https://doi.org/10.1016/j.joi.2016.07.003.
Gee, Bryan M. 2025a. “Code for: The Hunt for Research Data: Development of an Open-Source Workflow for Tracking Institutionally Affiliated Research Data Publications.” Software. Zenodo. https://doi.org/10.5281/zenodo.18037530.
Gee, Bryan M. 2025b. "Data for: The Hunt for Research Data: Development of an Open-Source Workflow for Tracking Institutionally Affiliated Research Data Publications.” Dataset. Texas Data Repository. https://doi.org/10.18738/T8/R9MSCP.
Gee, Bryan M. 2025c. “Supporting Appendices for: The Hunt for Research Data: Development of an Open-Source Workflow for Tracking Institutionally Affiliated Research Data Publications.” Text. Zenodo. https://doi.org/10.5281/zenodo.18039711.
GitHub. 2024. “Octoverse: AI Leads Python to Top Language as the Number of Global Developers Surges.” The GitHub Blog, October 29. https://web.archive.org/web/20251231051356/https://github.blog/news-insights/octoverse/octoverse-2024.
Goucher, Adam, Dave Hunt, David Burns, et al. 2025. Selenium: Official Python Bindings for Selenium WebDriver. Python. V. 4.33.0. Released May 23. https://www.selenium.dev.
Gould, Maria, and Daniella Lowenberg. 2019. “ROR-Ing Together: Implementing Organization IDs in Dryad.” Blog. Research Organization Registry (ROR), July 10. https://ror.org/blog/2019-07-10-ror-ing-together-with-dryad.
Gregory, Kathleen. 2020. “A Dataset Describing Data Discovery and Reuse Practices in Research.” Scientific Data 7 (1): 232. https://doi.org/10.1038/s41597-020-0569-5.
Gregory, Kathleen, Paul Groth, Andrea Scharnhorst, and Sally Wyatt. 2020. “Lost or Found? Discovering Data Needed for Research.” Harvard Data Science Review 2 (2): 1–51. https://doi.org/10.1162/99608f92.e38165eb.
Gualandi, Bianca, Luca Pareschi, and Silvio Peroni. 2022. “What Do We Mean by ‘Data’? A Proposed Classification of Data Types in the Arts and Humanities.” Journal of Documentation 79 (7): 51–71. https://doi.org/10.1108/JD-07-2022-0146.
Guy, Marieke, Martin Donnelly, and Laura Molloy. 2013. “Pinning It down: Towards a Practical Definition of ‘Research Data’ for Creative Arts Institutions.” International Journal of Digital Curation 8 (2): 99–110. https://doi.org/10.2218/ijdc.v8i2.275.
Habermann, Ted. 2023. “Improving Domain Repository Connectivity.” Data Intelligence 5 (1): 6–26. https://doi.org/10.1162/dint_a_00120.
Habermann, Ted. 2024. “Sustainable Connectivity in a Community Repository.” Data Intelligence 6 (2): 409–428. https://doi.org/10.1162/dint_a_00252.
Hamilton, Daniel G., Kyungwan Hong, Hannah Fraser, Anisa Rowhani-Farid, Fiona Fidler, and Matthew J. Page. 2023. “Prevalence and Predictors of Data and Code Sharing in the Medical and Health Sciences: Systematic Review with Meta-Analysis of Individual Participant Data.” BMJ 382 (July): e075767. https://doi.org/10.1136/bmj-2023-075767.
Hamilton, Daniel G., Matthew J. Page, Sue Finch, Sarah Everitt, and Fiona Fidler. 2022. “How Often Do Cancer Researchers Make Their Data and Code Available and What Factors Are Associated with Sharing?” BMC Medicine 20 (1): 438. https://doi.org/10.1186/s12916-022-02644-2.
Hendricks, Ginny, Dominika Tkaczyk, Jennifer Lin, and Patricia Feeney. 2020. “Crossref: The Sustainable Source of Community-Owned Scholarly Metadata.” Quantitative Science Studies 1 (1): 414–427. https://doi.org/10.1162/qss_a_00022.
Hofelich Mohr, Alicia, and Mikala Narlock. 2024. “DataCurationNetwork/Rads-Metadata: Article Acceptance.” Dataset. Zenodo. https://doi.org/10.5281/zenodo.11073357.
Holcombe, Alex O. 2019. “Contributorship, Not Authorship: Use CRediT to Indicate Who Did What.” Publications 7 (3): 3. https://doi.org/10.3390/publications7030048.
Howison, James, and Julia Bullard. 2016. “Software in the Scientific Literature: Problems with Seeing, Finding, and Using Software Mentioned in the Biology Literature.” Journal of the Association for Information Science and Technology 67 (9): 2137–2155. https://doi.org/10.1002/asi.23538.
Istrate, Ana-Maria, Joshua Fisher, Xinyu Yang, et al. 2024. “Scientific Software Citation Intent Classification Using Large Language Models.” Natural Scientific Language Processing and Research Knowledge Graphs: First International Workshop, NSLP 2024, Hersonissos, Crete, Greece, May 27, 2024, Proceedings, August 15, 80–99. https://doi.org/10.1007/978-3-031-65794-8_6.
Istrate, Ana-Maria, Donghui Li, Dario Taraborelli, Michaela Torkar, Boris Veytsman, and Ivana Williams. 2022. “A Large Dataset of Software Mentions in the Biomedical Literature.” Preprint, arXiv,September 27. https://doi.org/10.48550/arXiv.2209.00693.
Johnston, Lisa R., Alicia Hofelich Mohr, Joel Herndon, et al. 2024. “Seek and You May (Not) Find: A Multi-Institutional Analysis of Where Research Data Are Shared.” PLOS ONE 19 (4): e0302426. https://doi.org/10.1371/journal.pone.0302426.
Kambouris, Steven, David P. Wilkinson, Eden T. Smith, and Fiona Fidler. 2024. “Computationally Reproducing Results from Meta-Analyses in Ecology and Evolutionary Biology Using Shared Code and Data.” PLOS ONE 19 (3): e0300333. https://doi.org/10.1371/journal.pone.0300333.
Karcher, Sebastian, Dessislava Kirilova, and Nicholas Weber. 2016. “Beyond the Matrix: Repository Services for Qualitative Data.” IFLA Journal 42 (4): 292–302. https://doi.org/10.1177/0340035216672870.
Lafia, Sara, and Werner Kuhn. 2018. “Spatial Discovery of Linked Research Datasets and Documents at a Spatially Enabled Research Library.” Journal of Map & Geography Libraries 14 (1): 21–39. https://doi.org/10.1080/15420353.2018.1478923.
Lichtenberg, Elinor M., Chase D. Mendenhall, and Berry Brosi. 2017. “Dataset Supplementing Lichtenberg et al. (2017) Foraging Traits Modulate Stingless Bee Community Disassembly under Forest Loss. Journal of Animal Ecology.” Dataset. Zenodo. https://doi.org/10.5281/zenodo.843615.
Lin, Jennifer. 2018. “Peer Review Publications.” Blog. Crossref Blog, August 12. https://doi.org/10.64000/gp78m-kkk93.
Löffler, Felicitas, Valentin Wesp, Birgitta König-Ries, and Friederike Klan. 2021. “Dataset Search in Biodiversity Research: Do Metadata in Data Repositories Reflect Scholarly Information Needs?” PLOS ONE 16 (3): e0246099. https://doi.org/10.1371/journal.pone.0246099.
Long-Fox, Brooke, Ana Andruchow-Colombo, Shreya Jariwala, Maureen O’Leary, and Tanya Berardini. 2024. “Addressing Global Biodiversity Challenges: Ensuring Long-Term Sustainability of Morphological Data Collection and Reuse through MorphoBank.” Conference Abstract. Biodiversity Information Science and Standards (Sofia, Bulgaria) 8: 529–537. https://doi.org/10.3897/biss.8.135124.
Loster, Samantha, and Ali Krzton. 2025. “A Multi-Strategic Approach to Locating Institutional Data Deposits.” Presentation. Research Data Access and Preservation (RDAP) Annual Meeting, Virtual. OSF, March 12. https://osf.io/s4yez.
Lowenberg, Daniella. 2021. “Doing It Right: A Better Approach for Software & Data.” Blog. Dryad News, February 8. https://web.archive.org/web/20251115004254/https://blog.datadryad.org/2021/02/08/doing-it-right-a-better-approach-for-software-amp-data.
Madrid, Raúl. 2016. “The Rise of Ethnic Politics in Latin America.” Dataset. Qualitative Data Repository, August 19. https://doi.org/10.5064/F6MS3QNV.
Maitner, Brian, Paul Efren Santos Andrade, Luna Lei, et al. 2024. “Code Sharing in Ecology and Evolution Increases Citation Rates but Remains Uncommon.” Ecology and Evolution 14 (8): e70030. https://doi.org/10.1002/ece3.70030.
Mannheimer, Sara, Jason A. Clark, Kyle Hagerman, Jakob Schultz, and James Espeland. 2021. “Dataset Search: A lightweight, community-built tool to support research data discovery.” Journal of eScience Librarianship 10 (1): 1189. https://doi.org/10.7191/jeslib.2021.1189.
McKinney, Wes. 2011. “Pandas: A Foundational Python Library for Data Analysis and Statistics.” Python for High Performance and Scientific Computing 14: 1–9.
Mongeon, Philippe, and Adèle Paul-Hus. 2016. “The Journal Coverage of Web of Science and Scopus: A Comparative Analysis.” Scientometrics 106 (1): 213–228. https://doi.org/10.1007/s11192-015-1765-5.
National Center for Biotechnology Information. n.d. “NCBI Website and Data Usage Policies and Disclaimers.” National Library of Medicine. Accessed June 26, 2025. https://www.ncbi.nlm.nih.gov/home/about/policies.
O’Leary, Maureen A., and Seth Kaufman. 2011. “MorphoBank: Phylophenomics in the ‘Cloud.’” Cladistics 27 (5): 529–37. https://doi.org/10.1111/j.1096-0031.2011.00355.x.
Oliphant, Timothy E. 2006. Guide to NumPy. Vol. 1. Trelgol Publications. https://ecs.wgtn.ac.nz/foswiki/pub/Support/ManualPagesAndDocumentation/numpybook.pdf.
Pan, Huitong, Qi Zhang, Eduard Dragut, Cornelia Caragea, and Longin Jan Latecki. 2023. “DMDD: A Large-Scale Dataset for Dataset Mentions Detection.” Transactions of the Association for Computational Linguistics 11: 1132–1146. https://doi.org/10.1162/tacl_a_00592.
Parker, William G., Sterling J. Nesbitt, Randall B. Irmis, et al. 2022. “Osteology, Histology, and Relationships of Revueltosaurus Callenderi (Project).” Dataset. MorphoBank. https://doi.org/10.7934/P620.
Pranckutė, Raminta. 2021. “Web of Science (WoS) and Scopus: The Titans of Bibliographic Information in Today’s Academic World.” Publications 9 (1): 12. https://doi.org/10.3390/publications9010012.
QGIS.org. 2025. QGIS Geographic Information System. QGIS Association. https://qgis.org.
Reitz, Kenneth. 2025. Requests: Python HTTP for Humans. Python. V. 2.32.4. Released June 9. https://pypi.org/project/requests.
Ross, Cody Cooper (@codycooperross). “DataCite Release Notes – December 2025,” GitHub, December 18, 2025. https://web.archive.org/web/20260109012514/https://github.com/datacite/datacite-suggestions/discussions/217.
Schindler, David, Felix Bensmann, Stefan Dietze, and Frank Krüger. 2022. “The Role of Software in Science: A Knowledge Graph-Based Analysis of Software Mentions in PubMed Central.” PeerJ Computer Science 8: e835. https://doi.org/10.7717/peerj-cs.835.
Sharma, Nitesh Kumar, Ram Ayyala, Dhrithi Deshpande, et al. 2024. “Analytical Code Sharing Practices in Biomedical Research.” PeerJ Computer Science 10: e2066. https://doi.org/10.7717/peerj-cs.2066.
Sheridan, Helenmary, Anthony J. Dellureficio, Melissa A. Ratajeski, Sara Mannheimer, and Terrie R. Wheeler. 2021. “Data Curation Through Catalogs: a Repository-Independent Model for Data Discovery.” Journal of eScience Librarianship 10 (3): 1203. https://doi.org/10.7191/jeslib.2021.1203.
Sostek, Katrina, Daniel M. Russell, Nitesh Goyal, Tarfah Alrashed, Stella Dugall, and Natasha Noy. 2024. “Discovering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search.” Harvard Data Science Review Special Issue 4. https://doi.org/10.1162/99608f92.4c3e11ca.
Staller, Amanda, Julian Gautier, Sarah Lippincott, et al. 2023. “GREI Collaborative Webinar: Use Cases in Generalist Repositories and Community Feedback.” Presentation. Virtual, August 1. https://doi.org/10.5281/zenodo.8208834.
Stevenson, James, Kori Kuzma, and Reece Hart. 2019. Eutils: “Python Interface to NCBI’s Eutilities API.” Python. V. 0.6.0. Released. https://github.com/biocommons/eutils.
Strecker, Dorothea. 2025. "How Permanent Are Metadata for Research Data? Understanding Changes in DataCite DOI Metadata." Quantitative Science Studies (Online Early): 1–24. https://doi.org/10.1162/QSS.a.407.
Sun, Guangyuan, Tanja Friedrich, Kathleen Gregory, and Brigitte Mathiak. 2024. “Supporting Data Discovery: Comparing Perspectives of Support Specialists and Researchers.” Data Science Journal 23 (48): 1–17. https://doi.org/10.5334/dsj-2024-048.
Tenopir, Carol, Suzie Allard, Kimberly Douglass, et al. 2011. “Data Sharing by Scientists: Practices and Perceptions.” PLOS ONE 6 (6): e21101. https://doi.org/10.1371/journal.pone.0021101.
Van Gulick, Ana, Gretchen Gueguen, and Julie Goldman. 2024. “Generalist Repository Metadata Schema 2.0 – Community Feedback.” Poster. Research Data Access and Preservation (RDAP) Annual Meeting, Virtual. OSF, March 13. https://osf.io/uhygv.
Van Wettere, Niek. 2021. “Affiliation Information in DataCite Dataset Metadata: A Flemish Case Study.” Data Science Journal 20 (13): 1–18. https://doi.org/10.5334/dsj-2021-013.
Vera-Baceta, Miguel-Angel, Michael Thelwall, and Kayvan Kousha. 2019. “Web of Science and Scopus Language Coverage.” Scientometrics 121 (3): 1803–1813. https://doi.org/10.1007/s11192-019-03264-z.
Vieira, Elizabeth S., and José A. N. F. Gomes. 2009. “A Comparison of Scopus and Web of Science for a Typical University.” Scientometrics 81 (2): 587–600. https://doi.org/10.1007/s11192-009-2178-0.
Visser, Martijn, Nees Jan van Eck, and Ludo Waltman. 2021. “Large-Scale Comparison of Bibliographic Data Sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic.” Quantitative Science Studies 2 (1): 20–41. https://doi.org/10.1162/qss_a_00112.
Wallis, Jillian C., Stasa Milojevic, Christine L. Borgman, and William A. Sandoval. 2006. “The Special Case of Scientific Data Sharing with Education.” Proceedings of the American Society for Information Science and Technology 43 (1): 1–13. https://doi.org/10.1002/meet.14504301169.
Wang, Ben. 2024. “Patterns and Disparities in Research Data Sharing: An Analysis of Researchers’ Data Sharing Behaviors at University of Rochester.” Poster. Research Data Access and Preservation (RDAP) Annual Meeting, Virtual. OSF, March 13. https://osf.io/na4e8.
Warner, Claire. 2025. “Developing a Dataset Catalog for the University of Alabama at Birmingham.” Presentation. Research Data Access and Preservation (RDAP) Annual Meeting, Virtual. OSF, March 12. https://osf.io/4b59q.
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 160018. https://doi.org/10.1038/sdata.2016.18.
Wink, Isaac. 2024. “Investigating Researcher Data Sharing Practices Using the DataCite API.” Presentation. All Things Open, Virtual, April 15. https://digitalcommons.kennesaw.edu/ato/2024allthingsopen/presentations/7.
Wu, Mingfang, Fotis Psomopoulos, Siri Jodha Khalsa, and Anita de Waard. 2019. “Data Discovery Paradigms: User Requirements and Recommendations for Data Repositories.” Data Science Journal 18 (3): 1–13. https://doi.org/10.5334/dsj-2019-003.
Zhao, Mengnan, Erjia Yan, and Kai Li. 2018. “Data Set Mentions and Citations: A Content Analysis of Full-Text Publications.” Journal of the Association for Information Science and Technology 69 (1): 32–46. https://doi.org/10.1002/asi.23919.
Zhu, Junwen, and Weishu Liu. 2020. “A Tale of Two Databases: The Use of Web of Science and Scopus in Academic Papers.” Scientometrics 123 (1): 321–335. https://doi.org/10.1007/s11192-020-03387-8.