Data Type

NIH Expectations

Briefly describe the scientific data to be managed and shared: 

  • Summarize the types (for example, 256-channel EEG data and fMRI images) and amount (for example, from 50 research participants) of scientific data to be generated and/or used in the research. Descriptions may include the data modality (e.g., imaging, genomic, mobile, survey), level of aggregation (e.g., individual, aggregated, summarized), and/or the degree of data processing. 
  • Describe which scientific data from the project will be preserved and shared. NIH does not anticipate that researchers will preserve and share all scientific data generated in a study. Researchers should decide which scientific data to preserve and share based on ethical, legal, and technical factors. The plan should provide the reasoning for these decisions. 
  • A brief listing of the metadata, other relevant data, and any associated documentation (e.g., study protocols and data collection instruments) that will be made accessible to facilitate interpretation of the scientific data 

Adapted from: Writing a Data Management & Sharing Plan

Why is this being asked?

Data without context loses its power and objectivity. By comprehensively describing your data, you are ensuring that the complete picture of your research is communicated, and that any derivative work resulting from your research remains academically honest. Additionally, by being asked to think about how your data will be generated, described, and structured before any data is collected, you are indicating a commitment to robust research and data practices. You will also be saving yourself time and effort, as well avoiding any headaches, by knowing exactly what data you are generating, where it is, and how to access and use it.

What to Include

Data Source and Type

  • Indicate where the data is being generated or pulled from. Are you collecting data from an instrument, survey, or electronic health record? Will you be aggregating multiple datasets together, or is your data the result of only one set of observations? 
  • Indicate what types of data are being generated. Is the result of your research a spreadsheet? Are you capturing images or video? 
  • Inficate which data will be shared, if any. Not all data generated as a result of your project need to be shared as part of you DMSP.The following types of data are not required to be shared per the NIH DMS Policy:
    • Data that are not necessary for or of sufficient quality to validate and replicate the research findings
    • Laboratory notebooks
    • Preliminary analyses that are not necessary for or of sufficient quality to validate and replicate the research findings
    • Completed case report forms
    • Drafts of scientific papers
    • Plans for future research
    • Peer reviews
    • Communications with colleagues
    • Physical objects, such as laboratory specimens

Level of Data Processing

  • Indicate the level of data processing/aggregation. There are several phases of data collection and analyzation, and you have some say over what, if anything, is made available. Below are some options when it comes to what you can share:
    • Raw: as collected from data source 
    • Processed: cleaned and organized for analysis, de-identified if applicable 
    • Summarized: data used to generate figures and tables 

Restrictions on Data

  • State any restrictions on data based on 'Protected Health Information' (PHI), IRB approval, data usage agreements or any other justifiable . If the data cannot be de-identified and still be usable, indicate that in this section. You will be able to go into more detail in later sections. 

Amount of Data

  • State the number of observations generated or used, even if using data from a public dataset. If the size or number of files in the dataset are significant, include that here. 

File Formats

  • When possible, choose open, non-proprietary formats. These formats will allow anyone to access and view your data, regardless of most software restrictions. Common preferred file formats are listed below, and an exhaustive list is maintained by The National Archives. There is always the possibility that your data will not be able to be made available in any of these formats, but all efforts should be made to find one that works for you. If not possible, indicate that here.
    • Images: .tiff, .png, .jpg, .bmp 
    • Text: .txt, .pdf 
    • Tabular data: .csv, .pdf 

File naming

  • Indicate the naming convention for the files you are sharing , or indicate where it can be found, in this section. Having a file naming convention not only makes is easier for others to find and use your data, but having a robust naming plan in place prior to conducting your research will help you stay organized and on-task. Below are some helpful tips when deciding how you will name your files:
    • Check for field-specific standards.  
    • For dates use: YYYYMMDD; for datetimes use YYYYMMDDThhmm (24 hour time) 
    • Do not include spaces; use ‘-’ or ‘_’ as separators if necessary 
    • Use versioning; file_v1.csv or file_v01.csv 
    • Include README file (see below) to explain naming conventions and any abbreviations 
    • Example: 20220922_NHDS_export_v01.csv 

Documentation

  • All good documentation begins with a README file. In general, this is detailed listing of data formats, structures, and naming conventions. You will want to indicate in this section whether or not you will be including a README file, and doing so can save you some time and effort when constructing your DMSP. Cornell University has a fantastic guide on how to construct a quality README here and Arizona provides a nice template here, but in general you will want to include the following at a minimum: 
    • Contact information 
    • File structure, including naming conventions and versioning nomenclature
    • File formats for each data type 
    • Codes (if applicable)
  • If applicable, any data collection instruments, such as surveys or extraction tools, or review protocols should be indicated in this section. 

Sample Responses 

  • This project will produce computer code (G-code, as txt), ink formulations (pdf), tabular data (csv), images (png), and physical artifacts for approximately 500 specimens. Tabular data will be collected from mechanical strength testing and spectrometry and will be saved as their raw output, with the R code used in their analysis also made available. Image data will primarily be histological, with both raw and stained images made available. A single physical artifact for each scaffold configuration will be cataloged and made available upon request. All labwork done to generate the inks will be documented in LabArchives and made available as a pdf. It is expected that each specimen will result in approximately 50 additional pieces of data, whichwill all be linked by an identifier unique to each specimen. Details will be included in a README file which will be made publically available at time of publication. All metadata and other relevant data will be made available. 
  • We will be using 10,000 data points from the National Hospital Care Survey (NHCS). The summarized data used for analysis will be made available, along with the R code used to process the initial NHCS data export. Details will be included in a README file which will be made publically available at time of publication. All metadata and other relevant data will be made available. 
  • This project will capture demographic and clinical data from 100 human participants (50 experimental/50 control). This study is approved by the Weill Cornell Medicine Institutional Review Board (Protocol 98734768), and the approved informed consent form does not allow for sharing of data collected during the study. As a result, data will not be shared at the conclusion of the study. However, per the WCM Data Retention Policy, all data and supporting documentation will be archived in the WCM Institutional Data Repository for Research.