Best Practices for Data Sharing and Archiving

Why Should I Share Data and Code

  1. It is required by some journal publishers (e.g., Nature), and funding agencies (e.g., the National Institutes of Health, National Science Foundation, etc.). It is expected other funding agencies will also require researchers to share data produced during the course of their funded research based on a recently released Office of Science and Technology Policy Memo.
  2. Data can be reused to answer new questions or find new interpretations and discoveries.
  3. Sharing data may lead to sharing research processes, workflows, and tools, which enhances the potential to replicate research results.
  4. Sharing data makes your publications more useful and citable by others, which increases their value.

 How and Where to Share Your Data and Code

  1. Deposit into a recognized data repository.
  2. If no NIH, domain, generalist, or other repository exists, then deposit your data in the Weill Cornell Institutional Data Repository for Research (WIDRR).  WIDRR is an archive without sharing functions.  
  3. When submitting an article, submit your data/code along with your manuscript to the journal’s designated repository.
  4. When sharing your data and code, bundle your data together in a systematic way by following best practices for organizing files and structuring code so others can easily understand and use the data and/or code. 
  5. When sharing your data and code, include enough information so that others can understand and reuse the data set. See the Data Documentation and Metadata page for more info.
  6. Follow best practices to include enough information in the readme files or elsewhere to make it possible to cite the dataset.  See how to cite data sources
  7. Follow best practices to ensure confidentiality of any human participants.

Thoughts about posting data

  1. While posting data on a web page can increase visibility and be helpful for presentation, it is not recommended as a strategy for data sharing.  Instead, deposit your data into a trusted repository that issues DOIs (digital object identifiers) and refer to your data by DOI when citing on web pages or in other media.
  2. Posting code on GitHub is an accepted way of sharing code. To enhance code citability and ensure the exact version of the code accompanies the associated data, deposit your code in a data repository.  Many data repositories support GitHub integration.

Archiving

Archiving ensures that data are properly selected, stored, and accessible.  It maintains the data’s logical and physical integrity, including security and authenticity, over time.  

Data and code archiving keep data clean, documented, organized, and as self-contained as possible.  Data can be archived in various ways:

  1. On local hard disk drives
  2. On long-term tape drives
  3. In the cloud
  4. In dedicated data repositories

A rule of thumb when archiving a project is to document and organize materials such that a colleague could understand the archival bundle without explanation.

Placing data in a dedicated repository is the preferred method of archiving publicly releasable datasets and code (e.g., data associated with published articles) since it allows for data reuse and citation and enables research reproducibility.  Check out the Data Repositories page to get started.  

Using a data repository significantly reduces the risk of data loss.  Data repositories have explicit acknowledgements about storage and how long data will remain available.  It is advisable to retain an offline copy of the data, which is made possible with the WCM Institutional Data Repository for Research (WIDRR).  

Data Retention

Weill Cornell faculty are responsible for retaining all relevant research data by entering required metadata, data, and a methods description into WIDRR, within three years of the final project closeout of either funded or unfunded research.  

  1. Faculty are strongly encouraged to deposit their research data in an approved public data repository when not restricted by reasons of data privacy or confidentiality.
  2. Faculty are required by Cornell University Policy to follow this public deposit with a deposit to WIDRR of metadata, a methods description, and a link to the public deposit, after publication or within three years of the final project closeout of either funded or unfunded research. 

Even when faculty leave WCM, they must make the WIDRR deposit as described in #2.

For more information, see the Samuel J. Wood Library Data Preservation, Access, and Associated Timelines site.

Confidentiality

It is vital to maintain the confidentiality of research subjects for ethical reasons and to ensure continued participation in research.  Sometimes, research data resulting from funded research cannot be shared. Specific policies address this situation, such as the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA).  

Researchers who want to ethically share sensitive and confidential data may want to consider the following:

  1. Include a provision for data sharing when obtaining informed consent from research participants.  Language for consent forms with a provision for data sharing is available.  
  2. Evaluate the sensitivity of your data by considering whether they contain either direct or indirect identifiers that could be combined with other public information to identify research participants.
  3. Obtain a confidentiality review – some data archives, such as Inter-University Consortium for Political and Social Research, will review your data for the presence of confidential information.