Journal of eScience Librarianship Journal of eScience Librarianship Common Errors in Ecological Data Sharing Common Errors in Ecological Data Sharing

Objectives : (1) to identify common errors in data organization and metadata completeness that would preclude a “reader” from being able to interpret and re-use the data for a new purpose; and (2) to develop a set of best practices derived from these common errors that would guide researchers in creating more usable data products that could be readily shared, interpreted, and used. Methods : We used directed qualitative content analysis to assess and categorize data and metadata errors identified by peer reviewers of data papers published in the Ecological Society of America’s (ESA) Ecological Archives . Descriptive statistics provided the relative frequency of the errors identified during the peer review process. Results : There were seven overarching error categories: Collection & Organization, Assure, Description, Preserve, Discover, Integrate, and Analyze/Visualize. These categories represent errors researchers regularly make at each stage of the Data Life Cycle. Collection & Organization and Description errors were some of the most common errors, both of which occurred in over 90% of the papers. Conclusions : Publishing data for sharing and reuse is error prone, and each stage of the Data Life Cycle presents opportunities for mistakes. The most common errors occurred when the researcher did not provide adequate metadata to enable others to interpret and potentially re-use the data. Fortunately, there are ways to minimize these mistakes through carefully recording all details about study context, data collection, QA/ QC, and analytical procedures from the begin-ning of a research project and then including this descriptive information in the metadata.


Introduction
Data are increasingly being recognized as important products of the scientific enterprise (U.S. GAO 2007;OSTP 2013) and funding agencies such as the U.S. National Institutes of Health and U.S. National Science Foundation.Both agencies now require that proposals include plans describing how data will be shared and managed (NIH 2003, NSF 2011).Similarly, professional societies and for the data as well as how the data were generated, organized, quality assured, and preserved (Michener et al. 1997).
The process of publishing data and metadata is relatively new to scientists in many domains.The Ecological Society of America's (ESA) data papers represent a unique type of article that the ESA has published since 2005.ESA's Ecology publishes the abstract describing the data paper and Ecological Archives publishes the comprehensive data sets and accompanying metadata that describe the content, context, quality, and structure of the data.Ecological Archives provides long-term access to data papers which authors are encouraged to periodically update to facilitate secondary data use and analysis.Data papers undergo extensive peer review to assess the submission's overall quality and significance to the ecological sciences as well as additional technical review of the data and metadata to ensure a high standard of usability.
The overall goal of this paper was to provide a detailed case study of common errors observed when researchers prepare data and documentation for sharing and archiving.The findings were derived from ecology but are applicable for other research disciplines that require data management for long-term archiving, as well as libraries, data librarians, and archivists that may play a role in supporting researchers.This study analyzed peer reviews of 53 ESA data papers published in Ecology and Ecological Archives between August 2005 and May 2012.The principal objectives of this study were: (1) to identify common errors in data organization and metadata completeness that would preclude a "reader" from being able to interpret and re-use the data for a new purpose (e.g.study repeatability, synthesis, or metaanalysis); and (2) to develop a set of best practices derived from these common errors that would guide researchers across disciplines in creating more usable data products that could be readily shared, interpreted, and used.

Data Collection
Data papers included both the data files and associated metadata that the author(s) submitted to the Ecological Archives.Data paper authors were required to follow the Ecological Archives metadata content standard which is based on the format described in Michener et al. (1997) and includes a comprehensive list of elements that, if adequately described in the data paper, should be sufficient to allow researchers unfamiliar with the data set to effectively interpret and reuse the data.
Two or more peer reviewers who the editor of Ecological Archives considered subjectmatter experts in the topic of the paper reviewed each submission.Peer reviewers focused on four aspects of the paper (ESA Archives 2012): "1.Importance and interest to Ecological Archives' users and readers.2. Scientific and technical soundness of the database.3. Originality.4. Degree to which metadata fully describe the content, context, quality, and structure of the database."Reviewers were encouraged to specifically comment on: metadata presentation and completeness; data organization, quality, and integrity; methods; study design; errors; and citations.The editor for Ecological Archives evaluated the reviews and decided whether to accept the data paper or allow for resubmission after the author(s) addressed the reviewers' comments.Revised data papers were further evaluated by the editor and published if the revisions were deemed suitable in responding to the reviewers' comments.
Ecology and Ecological Archives published all 53 data papers used in this analysis after requested revisions were completed, including satisfactorily addressing all issues identified by the reviewers (Table 1).A total of 104 peer reviews of all published data papers provided the data that were analyzed for this paper (Table 1).Peer-review com-ments of rejected data papers were not available for analysis; in most such cases, the editor rejected the data papers as inappropriate for Ecological Archives, and the data papers were not sent to peer reviewers.
The number of data papers submitted generally increased over time (see Table 1).Researchers submitted few data papers during the first few years.The number of data papers submitted increased in the fourth year, with the peak in 2011, the last full year analyzed.

Data Analysis
Directed qualitative content analysis (Zhang & Wildemuth 2009) was used to assess data and metadata errors identified by peer reviewers of papers published in the Ecological Archives.Descriptive statistics provided the relative frequency of the various errors identified during the peer-review process.
Analysis of the data paper reviews consisted of qualitative coding of errors followed by quantitative analysis of those codes.First, five data papers were selected at random, and reviewer-identified errors were identified and listed.
Second, those errors were grouped into the Data Life Cycle elements described by Michener and Jones (2012): (1) Collection & Organization; (2) Assure (including quality assurance and quality control); (3) Description (i.e., ascribing metadata to the data); (4) Preserve; (5) Discover; (6) Integrate; and (7) Analyze/Visualize.While the Data Life Cycle also includes an eighth element (Planning), these types of errors were not apparent in the reviewer's comments.Third, errors were assigned to more detailed categories based on the metadata elements identified by Michener and others (1997).For a complete list of error categories, see Appendix A. Finally, the reviews of the remaining 48 data papers were analyzed by categorizing the reviewer-identified errors descriptive statistics related to each of the data life cycle elements and error classes.
Initially, this entailed noting the total number of errors for each detailed category and Data Life Cycle element.This information was used to calculate the mean number of errors for the overarching Data Life Cycle elements, error classes, and detailed error categories.In the next step, the number of papers with each detailed error category was calculated, as well as the percent of papers with each error.Finally, the number of errors for each paper in the overarching Data Life Cycle elements and error classes was tallied.This allowed the calculation of the median number of errors for each category, as well as the mean and median number of errors per paper.Results of this quantitative analysis are presented below.

Results
Reviewers identified an average of 20.3 errors per data paper.The numbers of errors identified by reviewers varied yearly and there were no consistent long-term trends into individual categories.When necessary, new categories were created.
To maintain consistency, a single researcher (Author #1) performed all initial coding and classification of reviewer-identified errors.
Occasionally, multiple reviewers pointed out the same error in a data paper.When this occurred, the error was counted once in the quantitative tally of errors performed later in the analysis process.This process resulted in the identification of more than 100 detailed categories, many of which were closely related.To narrow this down, overlapping categories and categories that contained only one or two identified errors were combined as appropriate, which resulted in 60 detailed error categories.Finally, the detailed error categories were grouped into Error Classes under the Data Life Cycle elements, where appropriate.After this process was complete, each overarching Data Life Cycle Element category had zero to four major error classes.
Quantitative analysis consisted of generating Most data papers (49 out of 53; 92.5%) had errors associated with Collection and Organization (Figure 2).On average, each data paper had 7.8 Collection and Organization errors (Figure 3).The most common Collection and Organization errors were in the description of collection methods (38; 71.7%); not adequately describing the data collection site or time frame (29; 54.7%); and omitting relevant variables that were important for future analysis of the data set (43.4%).Nearly half (26; 49.1%) of the papers had an error in the description of the data collection protocol, including errors of omission, such as neglecting to explain how long samples, such as water or soil samples, were stored before analysis.Errors in the description of the data collection site included not describing how the site was determined or subdivided, including whether critical points of plots, such as edges or center points, were clearly marked.
Over half of the papers analyzed (28; 52.8%) had errors in the description of Quality Assurance/Quality Control (QA/QC) procedures (Figure 2), with an average of 1.2 errors per paper (Figure 3).Nearly one third of the papers (32.1%) did not adequately describe their QA/QC procedures.Errors ranged from neglecting to provide basic statistics regarding the data, such as ranges or mean values, to incomplete descriptions of logical consistency checks or benchmarks used to verify the accuracy of the data.
The most common Data Life Cycle Element errors were Description errors (51; 96.2%) (Figure 2), and data papers contained an average of 9.3 Description errors (Figure 3).Many such errors (83.0%) were simple edit- regarding the maintenance of the data set, in cases of data sets archived over extended periods.
Over half of the papers (28; 52.8%) had Discover errors that would affect the ability to discover a particular data set and to assess the data set's utility (Figure 2).Despite the large number of papers with Discover errors, the average number of Discover errors per paper was much lower, with only 1.2 errors per paper (Figure 3).The most common Discover errors were insufficient description of access or use constraints (7; 13.2%); insufficient description of the data set's contributions and limitations (18; 34.0%); and not including information that would make finding the data set easier for potential data reusers (11; 20.8%).Of this last category, 17.0% were a result of authors not including all relevant information in the abstract such as not including the years of data collection or not summarizing the data collection methods.
About six percent (3; 5.7%) of these papers had errors in the integration of data sets ing errors, including grammatical errors (44; 58.5%) that ranged from awkward sentence structure or wordiness, to simple mistakes that an automatic grammar check would catch, such as missing spaces after a period.Errors in descriptive metadata were also very common (39; 73.6%) and many researchers (24; 45.3%) had a tendency to use either vague terms, such as "moderate" or "extreme," or field jargon, such as "degree of fragmentation," without clearly defining those terms.Finally, 43.4% (23) of the papers did not adequately describe the overall research project, such as not providing the background information required to get a clear understanding of the scientific context or questions that framed the study.
Reviewers of data papers noted errors related to the long-term preservation and storage of the submitted data sets in about one in five papers (12; 22.6%), (Figure 2).For example, an author may have mentioned that he or she kept all original data and records in personal offices or computers, or was storing data in proprietary or non-archival formats.Authors might not provide details 8

Data collection and organization best practices
Researchers could avoid many errors by taking detailed notes before, during, and after the data collection process.This starts with describing the study and the goals for the study.Authors of nearly half the papers analyzed (43.4%) did not sufficiently describe the project background, goals, or research questions.This information is essential, since it describes the larger research project and provides the scientific context that shapes the decisions made regarding data collection and analysis (Strasser et al. 2012).
Contextual information includes the spatial location of the data collection site, the time frame when data collection occurred, and environmental factors that could affect the observations and subsequent interpretation of the data.Photos, maps, and GPS coordinates of the data collection site are critical to data reuse, especially if future researchers choose to resample the area.This is especially important, since many sites are changing due to natural or human-caused changes.
Metadata associated with most (71.7%)data papers lacked sufficient detail about data collection process and methods, including experimental manipulations, measurements, and sampling choices made during the data collection process.Information about sampling designs, research methods, and identification of project personnel is central to interpreting and using data (Michener et al. 1997).

Data quality assurance and control best practices
Metadata from most data papers (52.8%) did not describe quality assurance and quality control (QA/QC) procedures in detail.Detailed descriptions of QA/QC procedures are critical for those looking to determine fitness (Table 2).Each data paper that had an error of this type failed to properly cite the sources of data that went into the integrated data set.For example, one data paper provided climatic data to supplement the data collected, but neglected to acknowledge the source of the climatic data.Another data paper did not use the most current version of the referenced data source.
While most data papers presented raw data sets, numerous papers included some analysis of the data.Seventeen percent (9) of the data papers analyzed had some type of error in the presentation of the analysis or visualization results (Table 2).These errors included neglecting to include statistical significance of the analysis results, not including all relevant variables, and not explaining how the data changed during the analysis process.Of the nine papers that had Analyze/ Visualize errors, seven authors did not sufficiently describe their analysis methods, such as not documenting formulas used to create new variables or data sets.

Discussion
Data are an important product of research.Data to be re-used in the future requires the careful preparation of metadata and documentation that allows future users to find and understand it.In this case study, common errors observed from reviews of Ecological Archives were compiled and described; these errors serve as the basis for informing data documentation.Despite any limitations associated with focusing on ecological data, many of the errors identified are representative of those occurring in other fields of research.The analysis of the cause of these errors, along with existing data management practices (Michener et al., 1997, Cook et al., 2001, Borer et al., 2009, Hook et al., 2010) provide examples across research disciplines for data documentation and preparation.
even if researchers intend to publish a complete set of their data in a repository.If the original data is only stored on a personal computer or in an office file cabinet, it is especially prone to loss through accidents (Michener et al. 1997).

Data discovery best practices
When researchers publish data in a repository, the ancillary information included with that data is essential for other researchers who need to find the data later.This entails including all relevant information in the abstract and listing all keywords that could describe the data.

Data integration and analysis best practices
Clear descriptions of all data integration and analysis steps, including any software used to process the data, is just as important for those reusing data as understanding the methods used to collect the data in the first place.Documenting any changes to the data set is a key part of maintaining data provenance (Strasser et al. 2012) and is critical to enabling data re-users to assess the confidence that they can place on the data (Chapman & Jagadish 2007).One way to document provenance is to include scientific workflows and code from software scripts such as R in the metadata since they can provide a record of changes made to the data (Borer et al. 2009). Various

Metadata and description best practices
If researchers have taken complete notes throughout the data gathering and analysis process, then the task of creating metadata becomes much easier.Researchers can save time and effort by carefully considering what metadata will be necessary at the outset of the project, rather than trying to correctly recall important details after the data collection and analysis process is complete.For instance, creating and maintaining a data dictionary is easiest when done at the inception of the study, instead of waiting until the data are ready to publish.Similarly, fully defining all codes and variables, including measurement method, units of measurement, as well as field site names (if appropriate) is easiest when done at the time of data collection and analysis.When writing metadata, another helpful guideline for researchers to follow is to use the same rules they would use when writing a paper for publication.This includes defining any jargon or acronyms and running the descriptive metadata through a grammar and spellchecking tool prior to submission.

Data preservation best practices
Preserving data for future research requires careful consideration.One important aspect of this is to save data in a non-proprietary format, such as ASCII text, while avoiding saving data in non-extractable formats, such as PDF.Storing data in multiple places helps to protect data from accidental loss, and research, libraries and librarians may be in a prime position to directly or indirectly support data management education for faculty and students (Treloar et al. 2012).Existing training and education tools as well as the best practices documented herein can provide the foundation for improving data stewardship in the sciences.

Conclusions
This paper provides a case study of common and representative errors observed when researchers prepare data and documentation for sharing and archiving.The findings were derived from Ecological Archives but are also applicable for other research disciplines that require data management for long -term archive.
One objective of this paper was to identify common errors in data organization and metadata completeness that would preclude a "reader" from being able to interpret and re -use the data.Publishing data for sharing and reuse is error-prone and each stage of the data life cycle presents opportunities for mistakes.In the data collection stage, researchers failed to describe their methods, the data collection site, or the context in which the samples were collected.Errors in the QA/QC stage of the life cycle occurred when researchers did not describe validation procedures, either during data collection or data entry.The most common errors are those where the researcher did not provide metadata that was adequate to enable others to interpret and potentially re-use the data.
The second objective was to use these common errors to develop a set of best practices for data management that would guide researchers across disciplines in creating more usable data products.A set of recommendations for best practices for data publication, summarized by elements of the data life cycle, are presented to enable researchers from many disciplines to create data products that are easier to share and re-use.

Figure 1 :
Figure 1: Average number of errors per paper by year by Data Life Cycle Element Category

Figure 2 :
Figure 2: Percent of data papers with errors in each Data Life Cycle Element Category.Each paper may have errors in multiple Data Life Cycle Categories.

Figure 3 :
Figure 3: Mean number of errors in a given Data Life Cycle Element Category.

Table 1 :
Number of data papers published by year, including total number of reviews per year and average number of reviews per data paper.Papers evaluated in 2012 represent a partial year.

Table 2 :
Descriptive statistics for the most common error categories for each stage of the Data Life Cycle.These will not sum up to 100%, since each data paper may have multiple errors in any given Data Life Cycle category.The total number of papers analyzed was 53.