Data Packaging Guidance

Preparing and sharing a complete data package is essential to meeting HEAL Initiative data sharing requirements and supporting FAIR (Findable, Accessible, Interoperable, and Reusable) research.

Data Packaging Examples and Best Practices

Overview

This resource provides guidance and supporting materials to help investigators assemble a well-organized HEAL-compliant data package that promotes transparency and long-term data reuse. This resource is organized into short, expandable sections describing what a data package includes and what HEAL-specific requirements apply. Investigators can also explore HEAL-funded project examples organized by data type.

Understanding the Data Package

What is a data package?

A data package consists of data and other files that make the data interpretable and reusable, providing additional context around what the study measured and how the data was collected. It typically includes data files, documentation (such as protocols and data collection instruments), metadata (such as codebooks), and code that documents data collection, organization, and analysis. HEAL studies submit their data packages to HEAL-compliant data repositories for long-term preservation.

What are HEAL’s expectations for data packages?

As part of the HEAL Initiative, data packages must meet specific sharing requirements:

De-identify identifiable human subject data prior to sharing
Include variable-level metadata (VLMD), describing the shared data file(s)
Provide links to publications generated using study data
Submit all underlying primary data, making it broadly available in an appropriate data repository

Consult the Checklist for HEAL-Compliant Data and Checklist Tracker Tool to track study progress through HEAL-required data sharing steps. Note: an investigator’s chosen HEAL-compliant repository may have additional repository-specific data packaging requirements.

Why Share Your Data Package?

The HEAL Initiative expects funded studies to “make their HEAL data Findable, Accessible, Interoperable, and Reusable (FAIR) in line with the HEAL Data Sharing Policy and broader efforts across NIH, as outlined in the NIH Strategic Plan for Data Science.” Sharing a FAIR data package enables others to understand, verify, and build upon a team’s work, expanding its impact across the research community. It also allows researchers to combine one study’s findings with other studies or reanalyze the data to generate new insights. By promoting transparency and facilitating reuse, data sharing contributes to scientific progress and innovation.

To learn more about sharing data, please refer to the following HEAL FreshFair Webinars:

When & What to Share: Understanding HEAL Data Sharing Requirements (Youtube)
Getting Ready to Share Your Data (Youtube)

Components of Data Package

Overview

At a minimum, a data package should include the core file components: data files, a README file, and variable-level metadata files, such as data dictionaries. Additional files that strengthen transparency and support long-term reuse include study protocols, blank data collection instruments, code or scripts, and publication citations that link study data to related research. Together, these materials create a clear and complete record of a team’s study, which helps others interpret their work, reproduce results, and build on their findings.

Core Data Package Components

Each data package should include the following core components:

Data file(s): These might be raw or processed data. When possible, share files in a non-proprietary file format, like .csv or .tsv, so others can open them easily.
A summary or README file: * A README, or summary file, describes context and characteristics of the data. It may describe experimental and data acquisition details, such as how data subjects were sampled, study start and end dates, and data collection activity geographic locations. A README may also describe the study or study files, such as the data package file directory structure, experimental and data acquisition details, any accompanying code documentation, and a list of software and versions used for data management and analysis. It may also include investigator names, institutes involved, funding sources, and version identifiers. Click here for additional guidance and templates. .
Variable-level Metadata documentation (often a data dictionary or codebook): Variable-level metadata (VLMD) helps others understand a study’s data file variables, listing and defining each variable, stating units of measurement, and noting coding schemes. Each data file should have a VLMD file. A VLMD file for tabular data (data organized by rows and columns) is called a data dictionary or a codebook. HEAL Stewards have developed HEAL-specific data dictionary preparation guidance and you can click here for general guidance on data dictionaries. A VLMD file template for HEAL Data Platform submissions and VLMD schemas can be found in the Github repository, and additional documentation on submitting VLMD is available here. Studies with non-tabular data should click here if they would like to schedule a consultation with the HEAL Data Stewards regarding suitable VLMD files for their data type.

Note: Repository-specific documentation
A selected HEAL-compliant repository may have additional data package requirements. Tip: navigate to the HEAL-compliant Repository Selection Guide, find the selected repository in the table, and click on “Guidance” under the “links” column to find additional information on that repository.

Additional Data Package Components

Sharing other files in the data package can increase understandability and reusability. Consider sharing:

Code used to transform raw data into analytic datasets: Include code/scripts/workflows documenting how raw data were cleaned, processed, or combined into analytic datasets, so others can understand and reproduce the study’s steps.
Code used to conduct analyses: Provide analysis scripts or notebooks, allowing others to verify results, reproduce findings, or adapt methods for their own research.
Publication citation(s): Linking data and publications helps others understand how the data were used, validate interpretations, and cite work accurately. Some repositories include a metadata field for citations, where investigators can list the digital object identifiers (DOIs) of related publications. It is also common to list related publication DOIs and/or web addresses in README files.
Study protocol(s): Protocols clarify how and why data were collected and provide information about the study design and methods. They support reproducibility, reduce misinterpretation, and guide others in designing compatible or follow-up studies.
Standard Operating Procedures (SOPs): Standard Operating Procedures (SOPs) document data generation and processing in more detail.
Blank data collection instruments: Include blank instruments, such as questionnaires, surveys, or forms. This helps others understand variable definitions, data structure, and question wording, which are critical for secondary analysis, harmonization, or comparison across studies.
Context or explanatory documents: Include write-ups explaining data quality issues, opaque data processing decisions, or missing information and files. These contextual details prevent misuse or misinterpretation and help secondary users assess data limitations. For example, some data packages include a one-page text file describing how to interpret missing values for certain variables. Others may include a brief description of steps taken to de-identify data.

General Guidance and Resources

Organizing Files and Folders

A clear file-naming convention and logical folder structure help ensure consistency, reproducibility, and ease of reuse. Files can be grouped by function, with separate folders for raw, standardized, and processed data, and distinct locations for scripts, metadata, and documentation. For example, the ICPSR Guide to Social Science Data Preparation and Archiving (6th ed.) and MIT Libraries’ Organize your files offer practical recommendations on folder hierarchies and data organization that apply broadly across disciplines.

Data should be shared in open, non-proprietary formats such as .csv, .tsv, .txt, .json, or .tiff, in keeping with NIH recommendations and the GREI Checklist for Data Submission in Generalist Repositories, which outlines best practices for ensuring file accessibility and interoperability. Following consistent structures allows others (and the original study team, in the future) to interpret and reuse the data with minimal effort.

Documentation and Metadata

Every HEAL data package must include sufficient documentation to make the data interpretable and reusable. A README or summary file should describe the study, data organization, file structure, and relevant software versions. The Cornell Data Management Guide provides practical instructions for creating README files that document dataset context, file organization, and processing details. The Digital Curation Centre’s Metadata Standards Directory provides links to community metadata frameworks that can help determine what descriptive details to include.

Accompany data with variable-level metadata (VLMD).

For tabular data, this is most likely a data dictionary or codebook, which defines each variable, lists valid values, and specifies measurement units. To align with HEAL’s metadata standards, reference the HEAL Variable-Level Metadata (VLMD) Schema available on GitHub. This schema describes the structure and required fields for HEAL variable-level metadata and is complemented by examples of valid and invalid VLMD files. Investigators requiring additional context may refer to the HEAL Data Dictionary Preparation Guidance or a basic Data Dictionary template which is available through the Open Science Framework (OSF).

For non-tabular data, VLMD may be embedded in the files or provided in related documentation, such as a README or protocol. Because formats vary by data type, investigators should contact the HEAL Stewards with questions about which VLMD formats are appropriate for their data.

Where applicable, include study-level documentation such as protocols, analysis code, or Standard Operating Procedures (SOPs) describing data collection, processing, and analytic methods.

Ethical and Privacy Considerations

When working with human subjects data, remove or obscure identifiable information prior to sharing following institutional and NIH guidelines. This includes removing direct identifiers (e.g., names, dates of birth, exact dates) and transforming indirect identifiers (e.g., location). If de-identification does not sufficiently reduce disclosure risk, controlled-access repositories can offer additional protection.

HEAL’s Sensitive Data guidance explains how to identify and prepare sensitive or restricted data for sharing, while protecting participant privacy and aligning with HEAL and NIH data sharing expectations. The HEAL Sensitive Data Decision Tool provides step-by-step guidance for determining whether additional or specialized safeguards are necessary for sensitive or potentially identifiable data. In situations where full de-identification is not feasible, make decisions about what to share collaboratively with the research team and the Institutional Review Board (IRB) to uphold participant privacy and ethical standards.

Documenting Analyses

Manage and share data in accordance with the 2023 NIH Data Management and Sharing Policy. Researchers should document data collection and processing steps, apply community standards and metadata, and organize files in open, interoperable formats.

Documenting all data processing and analysis steps ensures datasets remain transparent, traceable, and reusable. Include the code or scripts used to transform raw data into analysis-ready formats in the package, as recommended in the GREI Checklist and the ICPSR Guide, both of which highlight the importance of capturing provenance and transformation details. Maintaining this level of documentation supports FAIR principles, allowing others to reproduce analyses, verify results, and adapt workflows for future research. Including version information for software, packages, and analysis pipelines further enhances traceability and integrity.