Decisions and activities at each stage of the study’s lifecycle impact data sharing. Select a tab to learn more about key topics in a lifecycle stage, why they matter, and what actions you can take.
This resource outlines key considerations affecting data sharing at each stage of the study lifecycle, including what to consider and why, specific actions that pave the way for successful data sharing, and links to supporting resources and tools for further guidance.
Select a study lifecycle stage from the left-hand tabs to learn more.
Additional resources:
The NIH data management and sharing plan (DMSP) requires researchers to plan for managing and sharing study data throughout the study lifecycle and beyond. Special considerations may apply for managing and sharing sensitive data, like identifiable human subjects data. The HEAL Initiative also has specific data sharing expectations.
Lessons learned: Each study is unique. Example DMSPs, templates, and boilerplate language are useful references, but researchers must take their study’s specific requirements, constraints, and resources into account when writing their DMSP. Failure to do so can result in DMSPs that are poorly aligned with HEAL data sharing expectations, difficult to implement, or disproportionately permissive or restrictive for the data being shared.
What to do:
Additional resources:
Making data accessible and reusable may incur costs. Factors directly influencing costs include: data repository choice, curation or de-identification support, data management technology solutions, and staff effort. Many other factors can indirectly influence costs, such as organizational data management and sharing resources or data management practice maturity.
Lessons learned: Failure to budget appropriately for data management and sharing can result in difficulty securing resources needed to fulfill data sharing commitments. This can increase risk, decrease compliance, and negatively impact the reusability of shared data. For example, if professional de-identification services are needed but not budgeted for, studies may respond by not sharing data at all (decreased compliance) or having untrained staff attempt de-identification (increased risk). Researchers less experienced with sharing data or those using an unfamiliar repository may underestimate the effort needed to prepare data for sharing, and thus fail to budget for adequate staff time or curation support services.
What to do:
Additional resources:
Organizations often have various technical and human resources supporting data management and sharing. These may be found in library data services (e.g. data curators), ethics boards (e.g. IRBs), privacy offices (e.g. honest brokers and privacy law experts), IT units (e.g. data management software, data security experts), data or analysis cores, and elsewhere across the organization.
Lessons learned: Organizational resources can impact both the DMSP and the budget. For example, some organizations offer free data curation resources. In those that don’t, studies requiring curation support should budget for curation fees and consider naming a data repository that offers curation services in their DMSP.
What to do:
Different actors and organizations have varying data sharing goals and ideas of how best to accomplish them. Identifying goals and roles early is especially important in collaborative research where multiple individual and organizational stakeholders must reach a consensus.
Lessons learned: If data sharing goals and roles aren’t agreed upon at a project’s beginning, obstacles can arise when it’s time to share data. There may be disagreements among collaborators about what data can or should be shared, and how or where to share it. Collaborators may mention local regulations or organizational policies circumscribing data sharing that other collaborators weren’t previously aware of. The collaborator(s) responsible for data management may not have planned ahead for sharing, necessitating extra effort to curate data and incurring costs the study didn’t budget for. Unanticipated obstacles can arise even when researchers share data within a collaboration, such as when a researcher at one organization transfers data to a collaborator at another organization.
What to do:
HEAL studies often involve sensitive data (e.g., identifiable human subjects data and proprietary intellectual property) subject to legal, contractual, and/or regulatory requirements. Understanding your organization’s compliance requirements is essential when setting data sharing goals, selecting resources, and designing data workflows.
Lessons learned: Responsible data sharing doesn’t always mean open access. Sensitive HEAL data can be shared safely using techniques like de-identification, anonymization, date shifting, dimensionality reduction, introducing statistical noise, and controlled data access. Sharing data too permissively carries risk, but failing to maximize appropriate data sharing may violate HEAL Initiative compliance expectations.
What to do:
Additional resources:
Studies can support ethical data by using frameworks that balance risks and benefits for all stakeholders, including researchers, study subjects, funders, and affected communities. This is especially important in HEAL studies involving groups like substance use disorder patients or American Indian & Alaska Native peoples (AI-AN) communities.
Lessons learned: Ethical data practices support scientific validity, build trust, and support long-term collaboration. Overlooking data ethics can lead to data quality issues, strained relationships, and lost opportunities for future collaborations.
What to do:
Research data is often subject to various contracts. Some contracts focus on data directly, like Data Transfer Agreements (DTAs), Data Use Agreements (DUAs), Data Sharing Agreements (DSAs), data licenses, or Terms of Use in data repositories. Other contracts that govern the research project more generally, such as Research Agreements or Confidentiality Agreements, may also have implications for data sharing.
Lessons learned: How contracts impact data sharing may not always be obvious, especially when contracts are signed early in the study lifecycle, long before data are actually shared. Using boilerplate language or templates uncritically, or signing contracts without fully understanding their implications, can result in contracts that do not accurately reflect the project’s data sharing requirements, goals and roles. It’s important to understand what contracts mean for data rights and data sharing before signing them to prevent unanticipated obstacles later in the study**.**
What to do:
Research data is often handled through numerous hardware and software components. Each component may have its own data privacy, security, compliance, usability, and accessibility considerations. Moreover, these components may be spread across multiple institutions with differing policies governing data and technology.
Lessons learned: Technology choice must balance two critical factors: the study’s requirements, and organizational requirements related to privacy, security, and compliance. IT departments often recommend pre-vetted technologies meeting organizational requirements, but these options don’t work for every study. Technologies which meet all study requirements (e.g., storing very large data, collecting data electronically offline, or enabling access for collaborators at other organizations) may not have been vetted by the organization, leading to privacy/security risks or consequences for not complying with organizational policy. Working with institutional IT departments to choose technology produces the best outcomes; however, it’s important study teams clearly communicate their requirements so IT colleagues can make appropriate recommendations.
What to do:
Additional resource:
Consistent data organization enables efficiency and collaboration for the study team and any future data re-users.
Lessons learned: Inconsistent file organization can lead to outdated or incorrect file use, hinder collaboration, and reduce reproducibility. Clear, consistent file and folder naming conventions, folder structures (e.g., organized by grant aim or data type), versioning, co-located metadata, and documentation explaining how the data was collected, processed, and structured make data easier to manage and use. Some data repositories require specific deposition structures; adopting these early will speed data sharing later.
What to do:
Data standards, including Common Data Elements (CDEs), improve interoperability and enable integration with similar datasets. The NIH HEAL Initiative requires pain clinical studies to use HEAL core CDEs and encourages other studies to use HEAL CDEs or more broad NIH CDEs when possible.
Lessons Learned: Using required CDEs from the start is important. Some HEAL repositories mandate data conformity as a deposition requirement (e.g., NIMH Data Archive). HEAL also requires pain clinical studies to use certain core Common Data Elements (CDEs) for human subjects data. Failing to deploy required CDEs when collecting data creates additional data preparation needs before sharing and may lead to conversion difficulties resulting in data loss. Even small changes to CDEs can result in non-compliance, so studies should ideally deploy data standards exactly as specified.
What to do:
Additional resources:
Data very often undergoes processing between when it’s collected and when it’s analyzed. Some common types of processes are cleaning, merging, de-identifying, annotating, or otherwise transforming data. The output data from these processes may be referred to as “clean” or “derived” or “analytic” data (vocabulary differs across disciplines). Processed, or derived, data should be reproducible, and traceable back to its raw inputs through code or other documentation.
Lessons Learned: Failure to adequately document and describe data processing hampers reproducibility. Studies that rely heavily on manual processes—like point-and-click operations in Excel—create data outputs that are difficult to reproduce without extensive documentation. This limits transparency, increases the risk of human error, and makes both data quality control and reproducibility challenging. By contrast, code-based processing creates a repeatable chain of transformations, enhancing both credibility and reproducibility.
What to do:
Sharing data becomes more complex when all or part of the dataset was acquired from external sources. Researchers may assume that once data is in their possession, it can be freely shared—but acquired data typically comes with restrictions. These limitations are often specified in Data Use Agreements (DUAs), license terms, repository conditions, or other contracts that govern the data's original release. Studies that incorporate previously collected data must factor in these limitations when sharing data.
Lessons Learned: If secondary and primary data are combined without carefully addressing these constraints, it can lead to noncompliance with legal or ethical requirements and may prevent the study from sharing its data publicly. Studies must carefully assess which portions of acquired data can be shared (if any), transform data files to omit or redact data points that cannot be shared, and reflect any restrictions in their Data Management and Sharing Plan (DMSP).
What to do:
Additional resource:
The software used to collect and analyze data may affect how accessible and reusable data are. Using proprietary software introduces dependencies on specific platforms, licenses, or file formats that may not be accessible to all users.
Lessons Learned: Data, code, or documentation saved in proprietary formats can become inaccessible to researchers who lack the necessary tools. This creates financial, geographic, or institutional barriers to reuse, limiting the reach of shared data and producing unequal opportunities to use data. In contrast, open-source tools and formats promote transparency, reduce barriers to access, and support broader reuse. Choosing open-source options whenever feasible helps ensure that research outputs remain usable and accessible over time.
What to do:
Additional resource:
A single data asset may be used in multiple ways (towards journal articles, posters, or conference presentations) and often undergoes several transformations. Tracking data provenance (origin, version history, and chain of transformations) ensures data transformations and outputs are traceable and verifiable.
Lessons learned: Without clear provenance it becomes difficult to determine which data version supports a specific finding. This limits reproducibility and may lead to misinterpretation or unnecessary raw data re-processing. Maintaining strong provenance builds trust, ensures transparency, and protects against misuse.
What to do:
Additional resource:
Data cleaning enhances data quality and integrity, ensuring findings are valid and datasets are suitable for reuse.
Lessons learned: Inconsistent or incomplete cleaning can compromise analysis and results, inhibit reproducibility, and distort downstream analyses and insights..
What to do:
Additional resources:
Protecting sensitive information is essential for ethical research and compliance with regulations like HIPAA. De-identification and anonymization transform or eliminate direct (e.g., names, email addresses) and indirect (e.g. geographic information) identifiers, but anonymization may require more specialized skills, such as statistical disclosure control.
Lessons learned: Unclear, inconsistent, or late de-identification or anonymization can lead to privacy breaches, legal and ethical noncompliance, rework, or unsharable data. On the other hand, excessive anonymization may strip away important context, reducing data’s reusability.
What to do:
Additional resources:
Metadata is essential for making research data findable, interpretable, and reusable. In the HEAL Data Ecosystem study-level and variable-level metadata (SLMD and VLMD) support HEAL Data Platform findability and enable HEAL Semantic Search interoperability. Tabular data VLMD is often stored in a data dictionary or codebook. Protocols, standard operating procedures (SOPs), and README files typically contain VLMD. Non-tabular VLMD may be stored in files such as interview scripts or coding schemas (qualitative data), instance files or annotations (imaging), or sequence files (genomics data).
Lessons learned: Insufficient metadata makes data hard to find, interpret, integrate, or reuse, even by the original study, over time. Missing metadata reduces visibility and increases the risk of misinterpretation or misuse.
What to do:
Additional resources:
Well-documented workflows and consistent organization ensure transparency, continuity, and reproducibility. Undocumented workflows lead to confusion, errors, and hindered reproducibility.
Lessons learned: Inconsistent file structures, missing metadata, and undocumented processing increase the risk of errors and make it harder to track data provenance. Well-documented workflows enable validation, replication, and continuity, which supports compliance with the NIH DMS Policy’s emphasis on maximizing data sharing “of sufficient quality to validate and replicate the research findings.”
What to do:
Additional resources:
Transparent interpretation of research findings depends on clearly stating the assumptions, decisions, and frameworks that shaped the analysis.
Lessons learned: Unrecorded choices about how to handle missing data, which variables to include, and model selection can lead to data misinterpretation, misuse, or reproducibility concerns. Secondary users risk drawing invalid conclusions without this context.
What to do:
Additional resource:
Use version control to connect datasets, scripts, and outputs to specific analyses and published findings. Clearly labeled, well-structured outputs (e.g. tables, charts, graphs, and other visualizations) support reproducibility and interpretability across publications, presentations, posters, and other mediums. The same data may apply to multiple similar outputs; for example, analytic results may appear on a poster several months before a different version is published in a journal.
Lessons learned: HEAL studies must share both data and research outputs appropriately. Outputs that lack context or links to underlying data and code limit reproducibility. Overwriting code or modifying data without documenting changes also impedes reproducibility and validation, undermining trust in the findings. Because the same data may generate multiple outputs, the study team and secondary users must be able to distinguish which versions of data and code files produced an output. For example, if a dataset is cited in a journal article, the version in the repository should match the one used in the publication.
What to do:
Storing data in a trusted research repository supports reproducibility, access, and reuse. HEAL studies are expected to share data through one or more of the 29 HEAL-compliant data repositories, evaluated for HEAL-funded data. Note: the HEAL Data Platform is a catalog that links to HEAL data in compliant repositories; it is not a data repository.
Lessons learned: Repositories provide a range of services to researchers during submission; review protocols and available curation support before submitting data. Use different repositories for different types of study data (e.g. one repository for sequence or imaging data, and another for code/scripts) if needed. Include VLMD with your data. Your organization may have membership benefits with some data repositories.
What to do:
Additional resources:
According to the NNLM, “Data curation is composed of research data management and digital preservation and involves processes such as adding metadata to make data more findable and understandable, ingesting data into a [data] repository, … validating file checksums and file fixity checks, and other tasks for organizing, cleaning, describing, enhancing, storing, and preserving data.” Data curation transforms data used by the study team into forms appropriate for external sharing and long-term preservation.
Lessons learned: Under the NIH DMS Policy and HEAL Public Access and Data Sharing policy, HEAL studies must appropriately share scientific data underlying research findings, but not all raw data needs to be shared. Curation often involves HEAL studies transforming data (e.g. de-identification) and generating metadata. Starting early reduces the workload later.
What to do:
Additional resources:
Open access allows anyone to freely access and reuse shared data (aka public access). Controlled and restricted access limit findability, accessibility, and reusability. Available access controls vary across data repositories. Data licenses or contracts can define allowable uses of shared data.
Lessons learned: HEAL studies may use open, controlled, or restricted access approaches. For sensitive data, consider repositories with secure platform protections and controlled access options.
What to do:
Additional resources:
Publishing research data with journal articles supports replication and re-use and is often required by the publisher. It can also boost citation counts and enhance research impact. Persistent identifiers (PIDs) allow research artifacts to be linked and referenced across different locations, promoting findability.
Lessons learned: HEAL policy expects that “Underlying Primary Data for the Publications will be made broadly available through an appropriate data repository.” “Available upon request” statements are not HEAL compliant.
What to do:
Sharing research data fosters collaboration and knowledge-building, cross-disciplinary discoveries, and public health progress. It enhances researcher visibility, supports career advancement, and increases publication/citation opportunities.
Lessons learned: Repositories often track dataset reuse metrics (views, downloads, citations), which may support tenure and other promotion considerations. Studies show that publications linked to repository-hosted data are up to 25% more likely to be cited.
What to do:
Additional resources: