Data Packaging Examples and Best Practices

These examples are from HEAL-funded studies that have submitted data to a HEAL-compliant repository. The datasets are publicly accessible, and the Principal Investigators have given the HEAL Stewards permission to link to their data packages. Some data types below do not have data package examples available yet. Examples will be added as they become available. In the meantime, general data sharing guidance materials are provided to help investigators prepare their data packages. While reviewing the examples below, look for the symbols that identify which core (✅) and additional (✔️) components each data package includes.

Animal Behavioral, lmaging, and Observational Data

Animal Behavioral Data

Data Package example:

Study title: Temporal Patterns of Spinal Cord Stimulation
Data Package Link (SPARC)

Why this is a good example:

This data package has a clear folder structure and file naming convention that makes it easy to find the data files and understand their relationships. The data package uses repository-specific (SPARC) requirements, documenting the study protocol using Protocols.io.

Components in this Data Package:

✅ Data files
✅ README or Summary file
✅ Variable-level Metadata documentation
✅ Repository-specific documentation
✔️ Code used to conduct analyses
✔️ Study Protocol

Animal EEGs

Data Package example:

Study title: Sleep, opiate withdrawal and the N/OFQ - NOP system
Data Package Link (National Sleep Research Resource)

Why this is a good example:

This dataset contains EEG and EMG recordings conducted in non-human primates. In addition to all the core components, this data package includes study methodology and links to the software used in analysis.

Components in this Data Package:

✅ Data files
✅ README or Summary file
✅ Variable-level Metadata documentation
✅ Repository-specific documentation
✔️ Code used to conduct analyses
✔️ Publication Citations
✔️ Context or explanatory documents

EEG-Related Resources:

Annotation, storage, sharing, and publication of EEG data (Youtube Link) highlights the importance of metadata labeling, annotation, storing, and publishing EEG data.
The Brain Imaging Data Structure (BIDS) is currently developing BEP 032: Microelectrode electrophysiology as a community-developed standard to provide clear structure and documentation practices for organizing, describing, and sharing animal electrophysiology data that promote transparency, interoperability, and long-term reuse.

Animal Audiovisual Data

Best Practices:

For non-human subjects data, such as video recordings of mice behavior, data repository options may be limited by large file sizes. Audiovisual recordings involving animals may raise ethical, regulatory, or programmatic considerations that extend beyond standard identifiability concerns. Research teams should carefully assess these factors when determining what to share and how to document the materials. Researchers may choose to share processed data from audiovisual recordings in addition to or rather than the raw recordings.

Sharing data behind access controls is an appropriate safeguard to manage sensitive data concerns.

Biomedical Imaging Data

Human EEGs

Best Practices:

EEG data sharing should follow community standards, such as the Brain Imaging Data Structure (BIDS), which aligns files and metadata with FAIR principles to support interoperability and reproducibility. Investigators should convert proprietary formats into open, widely supported ones, such as the European Data Format (EDF), and ensure all accompanying metadata are complete and machine-readable. Metadata files (in .json and .tsv formats) should describe hardware details, such as sampling rate and electrode placement, along with task design, event markers, and de-identified participant information.

EEG-Related Resources:

The Brain Imaging Data Structure (BIDS) includes an Electroencephalography Extension. This framework provides clear, standardized file and folder organization as well as example datasets.
Annotation, storage, sharing, and publication of EEG data (Youtube Link) highlights the importance of metadata labeling, annotation, storing, and publishing EEG data.

MRI and fMRI Imaging Data

Best Practices:

Neuroimaging data are typically shared in standardized formats such as DICOM (.dcm) and NIfTI (.nii or .nii.gz). DICOM files include structured headers containing image metadata, while NIfTI files store image data with an optional .json metadata file, often generated during conversion (e.g., using dcm2niix). For sharing and curation, investigators should de-identify data, removing facial features and any “burned-in” text containing protected health information (PHI). When possible, organize files according to the Brain Imaging Data Structure (BIDS) to ensure consistency and interoperability. Common tools for viewing or validating files include ImageJ, MRIcron, and AFNI.

Body MRI data cover multiple anatomical regions, including abdominal, cardiovascular, and musculoskeletal areas, and should be carefully documented to capture acquisition parameters, such as body region, coil type, contrast timing, and patient positioning. DICOM is the standard format for preserving image metadata, but when converting to shareable formats like NIfTI, investigators should retain all relevant sequence parameter and anatomical coverage information. While community standards for body MRI data organization are still developing, investigators should apply consistent folder structures and include sidecar metadata files (e.g., .json and .tsv) describing imaging series, de-identification methods, and protocol details.

MRI, fMRI, and body MRI Resources:

Neuroimaging DICOM and NIfTI Primer (Data Curation Network) is a joint effort by the Data Curation Network, National Institutes of Mental Health, and National Institute of Neurological Disorders, including a comprehensive Primer for DICOM and NIfTI neuroimaging data.
The COIBIDAS Report (PDF), developed by the Organization for Human Brain Mapping (OHBM), outlines best practices for fMRI data analysis and sharing. While not a required standard, these community recommendations are widely recognized for promoting neuroimaging research transparency, reliability, and reproducibility.
For submission to OpenNeuro, follow the guidance and submission steps, and use OpenNeuro’s BIDS Validator to check that the dataset meets community standards.
For body MRI data, the American College of Radiology’s best practices for data sharing emphasize protecting patient anonymity, curating clinically relevant information to support reuse, and ensuring transparency, compliance, and responsible body MRI data sharing.

Microscopy (Cellular Imaging)

Best Practices:

Preparing microscopy data for long-term use and reproducibility involves converting proprietary image files into open formats such as OME-TIFF or the cloud-optimized OME-Zarr, which keep complex image data accessible and reusable. Each data package should include detailed metadata—both the technical acquisition information captured by the microscope and the broader experimental context summarized in a README file. Using clear, descriptive file names and a consistent folder structure to separate raw, standardized, and processed data helps others (and the original study team, in the future) navigate the package easily.

Microscopy Resources:

Open Microscopy Environment includes free, open tools and metadata specifications to support microscopy data management.
The Brain Imaging Data Structure (BIDS) now includes a BIDS Formatting Microscopy Extension. This framework provides a clear, standardized way to organize files and folders as well as example datasets.
IDR Repository Templates for Imaging Metadata is not a HEAL-compliant repository, but it provides useful examples and templates for microscopy data sharing via a repository.

Ultrasound

Best Practices:

Ultrasound data is often organized and shared in the Digital Imaging and Communications in Medicine (DICOM) format, which preserves standardized image data and acquisition details. For research applications, DICOM files can be converted to NIfTI to support analysis and visualization. Investigators should keep both the original manufacturer files and standardized exports. Metadata should include details about the imaging device, acquisition settings, and imaging mode, but all Protected Health Information (PHI) must be removed or de-identified. Organizing files in a BIDS-like structure with clear, descriptive names, supports readability and automated processing. Any details not captured in the DICOM header should be added in accompanying .json or .tsv files. Documenting software, processing steps, and data provenance helps maintain transparency and reproducibility.

Ultrasound Resource:

The Brain Imaging Data Structure (BIDS) is currently developing a BIDS Extension Proposal for Ultrasound. This framework is striving to provide a clear, standardized way to organize files and folders.

Clinical and Health Data

Clinical Trials Data

Best Practices:

NIH-funded clinical trials must be registered on ClinicalTrials.gov, and the assigned CTN number should be included in all data packages. For human subjects studies focused on pain, investigators must incorporate the HEAL Pain Common Data Elements (CDEs) to ensure data collection consistency, enable cross-study comparisons, and support integration with other HEAL-funded research. To promote consistency, interoperability, and reuse, clinical trial data may be organized using standardized formats such as Clinical Data Interchange Standards Consortium (CDISC) standards, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), or another standard that fits the data type. Each data package should include de-identified participant-level data, a detailed data dictionary, study protocols, analysis code, and informed consent materials that describe data use and sharing. Maintaining traceability from raw data to analyzed results ensures transparency, reproducibility, and NIH HEAL data sharing requirement alignment.

Resources:

Clinical Trials Data Curation Primer (Data Curation Network) guides researchers through key steps, offering best practices and tools to ensure trial data are well organized, shareable, and reusable.
Rethinking Clinical Trials: A Living Textbook is an expert-curated resource for pragmatic clinical trial design, conduct, and reporting, offering best practices, case studies, and tools to help researchers improve trial execution and data sharing.

Clinical Trials Questionnaires / Survey Instruments

Data Package example:

Study title: CTN-0093: Validation of a Community Pharmacy-based Prescription Drug Monitoring Program Risk Screening Tool (PHARMSCREEN)
Data Package Link (NIDA Data Share)

Why this is a good example:

This dataset is openly accessible (but repository access requires a name and email) and includes thorough documentation describing data preparation and de-identification. It also provides a data dictionary, null value file, and study-level metadata with the protocol. Together, these elements make it a clear, well-structured example of a tabular data package that aligns with HEAL data sharing expectations.

Components in the Data Package:

✅ Data file(s)
✅ README or Summary file
✅ Variable-level Metadata documentation
✅ Repository-specific documentation
✔️ Code used to transform raw data to analytic dataset
✔️ Code used to conduct analyses
✔️ Publication Citation(s)
✔️ Study Protocol
✔️ Blank data collection instruments
✔️ Context or explanatory documents

Biomedical Laboratory Data

Data Package example:

Study Title: Developing equilibrative nucleoside transporter inhibitors as non-opioid pain therapeutics
Data Package Link (Mendeley)

Why this is a good example:

This dataset contains a well-structured experimental data package supporting transparency and reuse across multiple research domains. It combines diverse data types (biochemical assays, imaging, electrophysiology, and behavioral studies) with detailed documentation that links methods, results, and analysis workflows. By providing both raw and processed data under an open data license, it demonstrates best practices for sharing complex, multimodal research in alignment with HEAL and FAIR principles.

Components in the Data Package:

✅ Data file(s)
✅ README or Summary file
✅ Variable-level Metadata documentation
✅ Repository-specific documentation
✔️ Context or explanatory documents

Lab Data-Related Resources:

A Practical Guide to data management and sharing for biomedical laboratory researchers (Experimental Neurology) includes sections on data standards, documentation, metadata, data dictionaries, protocols, code, and workflows for lab-created data.

Omics Data

Genomics / Sequencing

Data Package example:

Study Title: Human Nociceptor and Spinal Cord Molecular Signature Center
Data Package Link (SPARC)

Why this is a good example:

This SPARC dataset provides an example of how long-read sequencing data can be packaged for transparency and reuse. The data include both raw and processed files, detailed documentation of experimental methods and sequencing workflows, and clear metadata describing instruments, file types, and analysis tools.

Components of the Data Package:

✅ Data file(s)
✅ README or Summary file
✅ Variable-level Metadata documentation
✅ Repository-specific documentation
✔️ Study Protocol
✔️ Context or explanatory documents

Data Package example:

Study Title: Targeting sensory ganglia and glial signaling for the treatment of acute and chronic pain
Data Package Link (GEO)
Data Package Link (SRA)

Why this is a good example:

These connected data packages demonstrate short-read sequencing data sharing through linked GEO and SRA records. The GEO submission includes both raw and processed files with clear metadata describing experimental design and analysis methods, while the SRA record provides access to the underlying sequencing reads in an open, standardized format. Together, they illustrate how coordinated repository submissions can support transparency, reproducibility, and long-term reuse.

Components of the Data Package:

✅ Data file(s)
✅ Summary or README file
✅ Variable-level Metadata documentation
✅ Repository-specific documentation
✔️ Publication Citation(s)

Genomics/Sequencing-Related Resources:

NIH Genomic Data Submission and Release Expectations outlines NIH expectations for timely submission, data access, and genomic dataset releases to ensure compliance with the Genomic Data Sharing Policy.
HEAL Stewards Guidance provides information on Genomic Data as a Sensitive Data Type.
GEO Submission Guidance provides step-by-step instructions for submitting functional genomics data to the Gene Expression Omnibus (GEO) repository, including file preparation, metadata, and repository-specific requirements.
GEO Templates offers downloadable spreadsheet templates to help researchers organize and format GEO submissions in a consistent, machine-readable structure.
SRA Submission Guidance explains how to prepare, validate, and submit sequencing data to the Sequence Read Archive (SRA), covering accepted file types, metadata, and submission tools.

Proteomic Data

Best Practices:

Share raw and processed data files, organized in a consistent folder structure, using open formats such as mzML or mzIdentML and accompanied by complete metadata, describing instruments, software, and analytical methods. Include version information, a descriptive README, and identifiers that link related files.

Proteomic Resources:

The Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) develops and maintains community-driven standards, file formats, and controlled vocabularies to support interoperable, reusable proteomics and mass-spectrometry datasets.
MassIVE, a HEAL-compliant repository, offers a dedicated platform to archive, browse, and re-analyze mass-spectrometry proteomics data, supporting community reuse and transparency through structured submission workflows.

Qualitative and Social Science Data

Audiovisual Data

Best Practices:

Each package should include metadata describing recording conditions, equipment, and any processing or editing steps, along with documentation noting consent status and collection context. When sharing audiovisual data, researchers should ensure consent aligns with NIH and HEAL data sharing expectations and should remove or obscure identifiable information, such as names, faces, or voices, to protect participant privacy.

Sharing data behind access controls is an appropriate safeguard to manage sensitive data concerns. Researchers may choose to share processed data derived from audiovisual recordings in addition to or rather than the raw recording files. For example, processed data may include redacted transcripts, coding schemas, or observational battery data.

Audiovisual Resource:

Audiovisual Data Curation Primer (Data Curation Network)

Interviews, Focus Groups, and Case Studies

Best Practices:

A well-prepared qualitative data package should include de-identified transcripts or text files, documentation describing the study design and analytic approach, and supporting materials, such as codebooks and interview guides, that provide essential reuse context. These components help ensure qualitative data are clearly documented, ethically shared, and useful for future research. When investigators worry that transforming the data files does not sufficiently de-identify them, access controls offer an additional layer of protection against deductive disclosure risk.

Qualitative Data Resources:

Qualitative Data Curation Primer (Data Curation Network) provides practical guidance to prepare, document, and share qualitative research data in alignment with NIH and HEAL expectations for responsible, transparent, and reusable data.
Managing Qualitative Data (Qualitative Data Repository) provides guidance to prepare and share qualitative research emphasizing ethical practices, clear documentation, and long-term data reuse.
Guide for Sharing Qualitative Data at ICPSR outlines best practices to prepare and document qualitative data, so it can be shared responsibly, understood by others, and reused in ways that align with HEAL’s transparency and data stewardship goals.
HEAL Public Access and Data Sharing policy describes HEAL’s expectations for protecting human data subject privacy.

Questionnaires

Data Package example:

Study Title: Examining the synergistic effects of cannabis and prescription opioid policies on chronic pain, opioid prescribing, and opioid overdose
Data Package Link (NAHDAP)

Why is this a good example:

This dataset provides a well-documented example of questionnaire data, including multiple rounds of expert responses, clearly defined variables, and comprehensive supporting documentation. It illustrates how structured data can be shared in a trusted and access-controlled repository to promote discoverability, transparency, and reuse.

Components in the Data Package:

✅ Data file(s)
✅ README or Summary file
✅ Variable-level Metadata documentation
✅ Repository-specific documentation
✔️ Blank data collection instruments
✔️ Context or explanatory documents

Social Media Analysis

Data Package examples:

Study Title: Tracking the opioid epidemic with social media: an early warning system
Data Package Link (Github)

Why this is a good example

This data package includes clear metadata, data dictionaries, and organized folders for raw data, processed outputs, and analysis code, making the workflow transparent and reproducible. By providing open access under an MIT license and including both documentation and code, it aligns with HEAL accessibility, transparency, and responsible data sharing principles.

Components in the Data Package:

✅ Data file(s)
✅ README or Summary file
✅ Variable-level Metadata documentation
✅ Repository-specific documentation
✔️ Code used to transform raw data to analytic dataset
✔️ Code used to conduct analyses
✔️ Publication Citation(s)
✔️ Study Protocol
✔️ Context or explanatory documents
✔️ Clear terms of reuse

Social Media-Related Resource:

Twitter Data Curation Primer (Data Curation Network) offers practical guidance to organize and share social media datasets in a transparent, reproducible way.