Preparing and sharing a complete data package is essential to meeting HEAL Initiative data sharing requirements and supporting FAIR (Findable, Accessible, Interoperable, and Reusable) research.
This resource provides guidance and supporting materials to help investigators assemble a well-organized HEAL-compliant data package that promotes transparency and long-term data reuse. This resource is organized into short, expandable sections describing what a data package includes and what HEAL-specific requirements apply. Investigators can also explore HEAL-funded project examples organized by data type.
A data package consists of data and other files that make the data interpretable and reusable, providing additional context around what the study measured and how the data was collected. It typically includes data files, documentation (such as protocols and data collection instruments), metadata (such as codebooks), and code that documents data collection, organization, and analysis. HEAL studies submit their data packages to HEAL-compliant data repositories for long-term preservation.
As part of the HEAL Initiative, data packages must meet specific sharing requirements:
Consult the Checklist for HEAL-Compliant Data and Checklist Tracker Tool to track study progress through HEAL-required data sharing steps. Note: an investigator’s chosen HEAL-compliant repository may have additional repository-specific data packaging requirements.
The HEAL Initiative expects funded studies to “make their HEAL data Findable, Accessible, Interoperable, and Reusable (FAIR) in line with the HEAL Data Sharing Policy and broader efforts across NIH, as outlined in the NIH Strategic Plan for Data Science.” Sharing a FAIR data package enables others to understand, verify, and build upon a team’s work, expanding its impact across the research community. It also allows researchers to combine one study’s findings with other studies or reanalyze the data to generate new insights. By promoting transparency and facilitating reuse, data sharing contributes to scientific progress and innovation.
To learn more about sharing data, please refer to the following HEAL FreshFair Webinars:
At a minimum, a data package should include the core file components: data files, a README file, and variable-level metadata files, such as data dictionaries. Additional files that strengthen transparency and support long-term reuse include study protocols, blank data collection instruments, code or scripts, and publication citations that link study data to related research. Together, these materials create a clear and complete record of a team’s study, which helps others interpret their work, reproduce results, and build on their findings.
Each data package should include the following core components:
.csv or .tsv, so others can open them easily.Note: Repository-specific documentation
A selected HEAL-compliant repository may have additional data package requirements. Tip: navigate to the HEAL-compliant Repository Selection Guide, find the selected repository in the table, and click on “Guidance” under the “links” column to find additional information on that repository.
Sharing other files in the data package can increase understandability and reusability. Consider sharing:
A clear file-naming convention and logical folder structure help ensure consistency, reproducibility, and ease of reuse. Files can be grouped by function, with separate folders for raw, standardized, and processed data, and distinct locations for scripts, metadata, and documentation. For example, the ICPSR Guide to Social Science Data Preparation and Archiving (6th ed.) and MIT Libraries’ Organize your files offer practical recommendations on folder hierarchies and data organization that apply broadly across disciplines.
Data should be shared in open, non-proprietary formats such as .csv, .tsv, .txt, .json, or .tiff, in keeping with NIH recommendations and the GREI Checklist for Data Submission in Generalist Repositories, which outlines best practices for ensuring file accessibility and interoperability. Following consistent structures allows others (and the original study team, in the future) to interpret and reuse the data with minimal effort.
Every HEAL data package must include sufficient documentation to make the data interpretable and reusable. A README or summary file should describe the study, data organization, file structure, and relevant software versions. The Cornell Data Management Guide provides practical instructions for creating README files that document dataset context, file organization, and processing details. The Digital Curation Centre’s Metadata Standards Directory provides links to community metadata frameworks that can help determine what descriptive details to include.
Accompany data with variable-level metadata (VLMD).
For tabular data, this is most likely a data dictionary or codebook, which defines each variable, lists valid values, and specifies measurement units. To align with HEAL’s metadata standards, reference the HEAL Variable-Level Metadata (VLMD) Schema available on GitHub. This schema describes the structure and required fields for HEAL variable-level metadata and is complemented by examples of valid and invalid VLMD files. Investigators requiring additional context may reference the Basic Data Dictionary template available through the Open Science Framework (OSF).
For non-tabular data, VLMD may be embedded in the files or provided in related documentation, such as a README or protocol. Because formats vary by data type, investigators should contact the HEAL Stewards with questions about which VLMD formats are appropriate for their data.
Where applicable, include study-level documentation such as protocols, analysis code, or Standard Operating Procedures (SOPs) describing data collection, processing, and analytic methods.
When working with human subjects data, remove or obscure identifiable information prior to sharing following institutional and NIH guidelines. This includes removing direct identifiers (e.g., names, dates of birth, exact dates) and transforming indirect identifiers (e.g., location). If de-identification does not sufficiently reduce disclosure risk, controlled-access repositories can offer additional protection.
HEAL’s Sensitive Data guidance explains how to identify and prepare sensitive or restricted data for sharing, while protecting participant privacy and aligning with HEAL and NIH data sharing expectations. The HEAL Sensitive Data Decision Tool provides step-by-step guidance for determining whether additional or specialized safeguards are necessary for sensitive or potentially identifiable data. In situations where full de-identification is not feasible, make decisions about what to share collaboratively with the research team and the Institutional Review Board (IRB) to uphold participant privacy and ethical standards.
Manage and share data in accordance with the 2023 NIH Data Management and Sharing Policy. Researchers should document data collection and processing steps, apply community standards and metadata, and organize files in open, interoperable formats.
Documenting all data processing and analysis steps ensures datasets remain transparent, traceable, and reusable. Include the code or scripts used to transform raw data into analysis-ready formats in the package, as recommended in the GREI Checklist and the ICPSR Guide, both of which highlight the importance of capturing provenance and transformation details. Maintaining this level of documentation supports FAIR principles, allowing others to reproduce analyses, verify results, and adapt workflows for future research. Including version information for software, packages, and analysis pipelines further enhances traceability and integrity.
