Data Sharing Throughout the Research Lifecycle

Decisions and activities at each stage of the study’s lifecycle impact data sharing. Select a tab to learn more about key topics in a lifecycle stage, why they matter, and what actions you can take.

Overview

This resource outlines key considerations affecting data sharing at each stage of the study lifecycle, including what to consider and why, specific actions that pave the way for successful data sharing, and links to supporting resources and tools for further guidance.

Select a study lifecycle stage from the left-hand tabs to learn more.

Pre-Award
Start Up
Generate/Collect
Process
Analyze
Share

Additional resources:

NIST’s Research Data Framework (RDaF) 2.0

Pre-Award

Writing a NIH data management and sharing plan (DMSP)

The NIH data management and sharing plan (DMSP) requires researchers to plan for managing and sharing study data throughout the study lifecycle and beyond. Special considerations may apply for managing and sharing sensitive data, like identifiable human subjects data. The HEAL Initiative also has specific data sharing expectations.

Lessons learned: Each study is unique. Example DMSPs, templates, and boilerplate language are useful references, but researchers must take their study’s specific requirements, constraints, and resources into account when writing their DMSP. Failure to do so can result in DMSPs that are poorly aligned with HEAL data sharing expectations, difficult to implement, or disproportionately permissive or restrictive for the data being shared.

What to do:

Align the DMSP with HEAL data sharing expectations. The HEAL Data Checklist includes guidance for multiple topics referenced in NIH DMSPs. These include: selecting a data repository, adopting appropriate data standards (like HEAL Common Data Elements (CDEs)), assigning persistent identifiers to shared data assets, and more. Contact the HEAL Data Stewards or the NIH Program Officer with questions.
Seek feedback on DMSP drafts from your Office of Research or Library Data Services. They may be able to identify potential conflicts with organizational data policies or opportunities to utilize organizational resources.

Additional resources:

Budgeting for data management and sharing

Making data accessible and reusable may incur costs. Factors directly influencing costs include: data repository choice, curation or de-identification support, data management technology solutions, and staff effort. Many other factors can indirectly influence costs, such as organizational data management and sharing resources or data management practice maturity.

Lessons learned: Failure to budget appropriately for data management and sharing can result in difficulty securing resources needed to fulfill data sharing commitments. This can increase risk, decrease compliance, and negatively impact the reusability of shared data. For example, if professional de-identification services are needed but not budgeted for, studies may respond by not sharing data at all (decreased compliance) or having untrained staff attempt de-identification (increased risk). Researchers less experienced with sharing data or those using an unfamiliar repository may underestimate the effort needed to prepare data for sharing, and thus fail to budget for adequate staff time or curation support services.

What to do:

Review NIH guidance on allowable costs for data management and sharing and include those applicable in the budget.
Ensure budgeted costs align with activities and resources described in the DMSP, and vice versa.

Additional resources:

Budgeting Resources for Data Management, Sharing, and Curation (HEAL Stewards)
Budgeting for Data Management & Sharing (NIH)

Identifying organizational resources for data management and sharing

Organizations often have various technical and human resources supporting data management and sharing. These may be found in library data services (e.g. data curators), ethics boards (e.g. IRBs), privacy offices (e.g. honest brokers and privacy law experts), IT units (e.g. data management software, data security experts), data or analysis cores, and elsewhere across the organization.

Lessons learned: Organizational resources can impact both the DMSP and the budget. For example, some organizations offer free data curation resources. In those that don’t, studies requiring curation support should budget for curation fees and consider naming a data repository that offers curation services in their DMSP.

What to do:

Consult Offices of Research, Library Data Services, or other centralized data sharing support units to identify organizational resources your study can utilize. Remember to consider resources offered at collaborating organizations; for example, some organizations have data repository memberships with benefits like waived deposition fees.

Start Up

Identifying data sharing goals and roles

Different actors and organizations have varying data sharing goals and ideas of how best to accomplish them. Identifying goals and roles early is especially important in collaborative research where multiple individual and organizational stakeholders must reach a consensus.

Lessons learned: If data sharing goals and roles aren’t agreed upon at a project’s beginning, obstacles can arise when it’s time to share data. There may be disagreements among collaborators about what data can or should be shared, and how or where to share it. Collaborators may mention local regulations or organizational policies circumscribing data sharing that other collaborators weren’t previously aware of. The collaborator(s) responsible for data management may not have planned ahead for sharing, necessitating extra effort to curate data and incurring costs the study didn’t budget for. Unanticipated obstacles can arise even when researchers share data within a collaboration, such as when a researcher at one organization transfers data to a collaborator at another organization.

What to do:

Build consensus about data sharing goals and roles across stakeholders early. Document conversations and their outcomes.
Refer to this documentation when establishing governing documentation [Memoranda of Understanding (MOUs), collaboration agreements, IRB protocols] and contracts to ensure alignment.
Define data rights and data decision-making roles. Data rights include ownership, access, and usage rights. Determine who decides if and how data will be shared, and who implements decisions. Document roles with a RACI matrix or similar tool.

Planning for sensitive data sharing, privacy, and confidentiality

HEAL studies often involve sensitive data (e.g., identifiable human subjects data and proprietary intellectual property) subject to legal, contractual, and/or regulatory requirements. Understanding your organization’s compliance requirements is essential when setting data sharing goals, selecting resources, and designing data workflows.

Lessons learned: Responsible data sharing doesn’t always mean open access. Sensitive HEAL data can be shared safely using techniques like de-identification, anonymization, date shifting, dimensionality reduction, introducing statistical noise, and controlled data access. Sharing data too permissively carries risk, but failing to maximize appropriate data sharing may violate HEAL Initiative compliance expectations.

What to do:

Talk to organizational privacy and compliance offices to ensure awareness of applicable requirements and policies. Understand how these impact data sharing.
Draft documents that impact data sharing to align with any sensitive data handling requirements. IRB application materials, Data Management and Sharing Plans (DMSPs), contracts, Data Security Plans (DSPs), and other study governing documents should accurately reflect relevant requirements.
Build in privacy protections to data management and sharing workflows. Technology resources and data handling processes should maintain required privacy protections throughout the entire data pipeline.

Additional resources:

Planning for ethical data sharing

Studies can support ethical data by using frameworks that balance risks and benefits for all stakeholders, including researchers, study subjects, funders, and affected communities. This is especially important in HEAL studies involving groups like substance use disorder patients or American Indian & Alaska Native peoples (AI-AN) communities.

Lessons learned: Ethical data practices support scientific validity, build trust, and support long-term collaboration. Overlooking data ethics can lead to data quality issues, strained relationships, and lost opportunities for future collaborations.

What to do:

Use appropriate data ethics principles, such as the CARE Principles for Indigenous Data Governance.
Adopt a framework like the Data Ethics Canvas by the Open Data Institute or US Federal Data Strategy Data Ethics Framework to guide ethical decision-making.

Understanding data licensing and contracts

Research data is often subject to various contracts. Some contracts focus on data directly, like Data Transfer Agreements (DTAs), Data Use Agreements (DUAs), Data Sharing Agreements (DSAs), data licenses, or Terms of Use in data repositories. Other contracts that govern the research project more generally, such as Research Agreements or Confidentiality Agreements, may also have implications for data sharing.

Lessons learned: How contracts impact data sharing may not always be obvious, especially when contracts are signed early in the study lifecycle, long before data are actually shared. Using boilerplate language or templates uncritically, or signing contracts without fully understanding their implications, can result in contracts that do not accurately reflect the project’s data sharing requirements, goals and roles. It’s important to understand what contracts mean for data rights and data sharing before signing them to prevent unanticipated obstacles later in the study**.**

What to do:

When working with institutional legal or contracts offices, ensure they understand study data sharing requirements and goals and draft or review contracts with those goals in mind.

Choosing technology for data management

Research data is often handled through numerous hardware and software components. Each component may have its own data privacy, security, compliance, usability, and accessibility considerations. Moreover, these components may be spread across multiple institutions with differing policies governing data and technology.

Lessons learned: Technology choice must balance two critical factors: the study’s requirements, and organizational requirements related to privacy, security, and compliance. IT departments often recommend pre-vetted technologies meeting organizational requirements, but these options don’t work for every study. Technologies which meet all study requirements (e.g., storing very large data, collecting data electronically offline, or enabling access for collaborators at other organizations) may not have been vetted by the organization, leading to privacy/security risks or consequences for not complying with organizational policy. Working with institutional IT departments to choose technology produces the best outcomes; however, it’s important study teams clearly communicate their requirements so IT colleagues can make appropriate recommendations.

What to do:

Consult your IT department to identify appropriate technology resources for handling study data. Consider both hardware and software components.
Choose secure and compliant file storage for the study data type(s). Your organization may only allow the use of certain data collection or file storage solutions for human subjects data or sensitive data. Consider long-term data storage requirements, which often extend beyond study’s end.
Test the full technology infrastructure end-to-end to identify any issues before using it to handle data. This is especially important for multi-institution research collaborations. For example, external collaborators may not be allowed to access all technologies hosted by your organization, or an organization may have policies governing data removal from certain technology resources.
Document the project’s technical infrastructure. Documentation may include information about hardware and software components (including versioning and update schedule), a data flow diagram, and data backup or disaster recovery plans. Some organizations require studies to have a Data Security Plan (example).

Additional resource:

Data Security Plan Development Guide for Researchers

Generate / Collect

Organizing data for re-use

Consistent data organization enables efficiency and collaboration for the study team and any future data re-users.

Lessons learned: Inconsistent file organization can lead to outdated or incorrect file use, hinder collaboration, and reduce reproducibility. Clear, consistent file and folder naming conventions, folder structures (e.g., organized by grant aim or data type), versioning, co-located metadata, and documentation explaining how the data was collected, processed, and structured make data easier to manage and use. Some data repositories require specific deposition structures; adopting these early will speed data sharing later.

What to do:

Adopt file/folder naming and organization conventions. See MIT Libraries file organization guidance, and/or follow repository guidance.
Define minimum documentation needs for each data asset, including data dictionaries (aka codebooks); data processing code, scripts, markdown files, and programs; README files; and blank data collection instruments.
Store documentation with the data it describes. Include README files, data dictionaries, and version histories to support interpretation.
Implement version control to prevent accidental data loss and improve traceability.

Using data standards and CDEs

Data standards, including Common Data Elements (CDEs), improve interoperability and enable integration with similar datasets. The NIH HEAL Initiative requires pain clinical studies to use HEAL core CDEs and encourages other studies to use HEAL CDEs or more broad NIH CDEs when possible.

Lessons Learned: Using required CDEs from the start is important. Some HEAL repositories mandate data conformity as a deposition requirement (e.g., NIMH Data Archive). HEAL also requires pain clinical studies to use certain core Common Data Elements (CDEs) for human subjects data. Failing to deploy required CDEs when collecting data creates additional data preparation needs before sharing and may lead to conversion difficulties resulting in data loss. Even small changes to CDEs can result in non-compliance, so studies should ideally deploy data standards exactly as specified.

What to do:

Identify your study’s CDE / data standard requirements, being mindful of pain study core HEAL CDEs and any additional HEAL-compliant repository data standards / CDE requirements.
When using Common Data Elements (CDEs), implement them exactly as specified in the data collection instrument(s), rather than planning to transform data after collection.
If CDE deviation is necessary (for example, translating into a new language), document deviations and their justifications and discuss them with your NIH PO.
Conform data to data standards early. Do not wait until you are sharing data.

Additional resources:

HEAL core CDEs (required for pain clinical studies)
HEAL CDE Repository
NIH Common Data Elements repository
Variable Standards Finder (HEAL Stewards)

Differentiating between raw and derived data

Data very often undergoes processing between when it’s collected and when it’s analyzed. Some common types of processes are cleaning, merging, de-identifying, annotating, or otherwise transforming data. The output data from these processes may be referred to as “clean” or “derived” or “analytic” data (vocabulary differs across disciplines). Processed, or derived, data should be reproducible, and traceable back to its raw inputs through code or other documentation.

Lessons Learned: Failure to adequately document and describe data processing hampers reproducibility. Studies that rely heavily on manual processes—like point-and-click operations in Excel—create data outputs that are difficult to reproduce without extensive documentation. This limits transparency, increases the risk of human error, and makes both data quality control and reproducibility challenging. By contrast, code-based processing creates a repeatable chain of transformations, enhancing both credibility and reproducibility.

What to do:

Avoid point-and-click and manual data manipulation processes as much as possible**.** Instead, use computational approaches (e.g. code / scripts) to perform data processing.
Share code / scripts as part of data sharing. Some data repositories accept code files as part of data deposits. If your study’s repository doesn’t accept code, use a suitable HEAL-compliant repository like GitHub to share code.
Store code with the dataset in a repository or include a link to an external repository where code is shared (e.g., GitHub).
Provide comments or separate documentation to explain processing steps in further detail, when appropriate.

Sharing acquired data

Sharing data becomes more complex when all or part of the dataset was acquired from external sources. Researchers may assume that once data is in their possession, it can be freely shared—but acquired data typically comes with restrictions. These limitations are often specified in Data Use Agreements (DUAs), license terms, repository conditions, or other contracts that govern the data's original release. Studies that incorporate previously collected data must factor in these limitations when sharing data.

Lessons Learned: If secondary and primary data are combined without carefully addressing these constraints, it can lead to noncompliance with legal or ethical requirements and may prevent the study from sharing its data publicly. Studies must carefully assess which portions of acquired data can be shared (if any), transform data files to omit or redact data points that cannot be shared, and reflect any restrictions in their Data Management and Sharing Plan (DMSP).

What to do:

When a study combines data from multiple sources into an aggregated dataset, and some of that data is acquired data (not generated by the study), it may not be appropriate to share the aggregated dataset. Refer to any restrictions placed on sharing the secondary data. In some cases, the study may be able to share data elements it generated, but not data elements from secondary data assets.
Clearly distinguish between generated and acquired data in documentation and DMSPs.
Note any sharing limitations in the DMSP.

Additional resource:

HEAL Sensitive Data

Choosing open source vs proprietary software

The software used to collect and analyze data may affect how accessible and reusable data are. Using proprietary software introduces dependencies on specific platforms, licenses, or file formats that may not be accessible to all users.

Lessons Learned: Data, code, or documentation saved in proprietary formats can become inaccessible to researchers who lack the necessary tools. This creates financial, geographic, or institutional barriers to reuse, limiting the reach of shared data and producing unequal opportunities to use data. In contrast, open-source tools and formats promote transparency, reduce barriers to access, and support broader reuse. Choosing open-source options whenever feasible helps ensure that research outputs remain usable and accessible over time.

What to do:

Choose open-source software over proprietary software for data workflows whenever possible. This includes tools used for data collection, management, and analysis.
If proprietary tools are used, share copies of outputs like code and data files in accessible formats (e.g., export data to CSV, save code in a TXT).

Additional resource:

Best practices for sharing research software

Process

Tracking data provenance

A single data asset may be used in multiple ways (towards journal articles, posters, or conference presentations) and often undergoes several transformations. Tracking data provenance (origin, version history, and chain of transformations) ensures data transformations and outputs are traceable and verifiable.

Lessons learned: Without clear provenance it becomes difficult to determine which data version supports a specific finding. This limits reproducibility and may lead to misinterpretation or unnecessary raw data re-processing. Maintaining strong provenance builds trust, ensures transparency, and protects against misuse.

What to do:

Establish version control strategies to track changes to data, code, and other applicable research processes. Some data repositories automatically version-control data deposits, including archiving older versions to support reproducibility.
Document provenance, maintaining records of the origin, transformations, and handling of datasets.

Additional resource:

HEAL Data Sustainability How To Guide

Cleaning data

Data cleaning enhances data quality and integrity, ensuring findings are valid and datasets are suitable for reuse.

Lessons learned: Inconsistent or incomplete cleaning can compromise analysis and results, inhibit reproducibility, and distort downstream analyses and insights..

What to do:

Automate cleaning with code, removing or correcting erroneous or duplicative data points to ensure datasets are valid, accurate, and fit for purpose.
Adopt quality control (QC) processes, including regular quality reviews or automated QC checks, to report on data quality measures during data collection and processing.
Document calculations, aggregations, derivations, and missing data, using code / scripts, SOPs, etc., to provide transparency into the data provenance and allow secondary users to assess dataset validity.
Transform, normalize, and aggregate data to ensure structural and semantic consistency. For example, convert categorical values into standard formats or align date/time formats across sources.

Additional resources:

De-identifying and anonymizing data

Protecting sensitive information is essential for ethical research and compliance with regulations like HIPAA. De-identification and anonymization transform or eliminate direct (e.g., names, email addresses) and indirect (e.g. geographic information) identifiers, but anonymization may require more specialized skills, such as statistical disclosure control.

Lessons learned: Unclear, inconsistent, or late de-identification or anonymization can lead to privacy breaches, legal and ethical noncompliance, rework, or unsharable data. On the other hand, excessive anonymization may strip away important context, reducing data’s reusability.

What to do:

Determine when de-identification should occur by considering factors such as data access permissions, storage environments, and the audience. Aim for early de-identification to reduce deductive disclosure risk.
Tailor de-identification to the data type. For example, redact or pseudonymize direct and indirect identifiers in qualitative data.
Apply techniques proportional to privacy requirements. For HEAL-funded research, de-identification and access controls are often sufficient for data privacy requirements.
Document de-identification techniques and tools. Describe any review or validation process performed and the results of the review.
Use a human validator to validate de-identification results. Institutions and data repositories may provide honest brokers or data curators for support.

Additional resources:

Generating metadata

Metadata is essential for making research data findable, interpretable, and reusable. In the HEAL Data Ecosystem study-level and variable-level metadata (SLMD and VLMD) support HEAL Data Platform findability and enable HEAL Semantic Search interoperability. Tabular data VLMD is often stored in a data dictionary or codebook. Protocols, standard operating procedures (SOPs), and README files typically contain VLMD. Non-tabular VLMD may be stored in files such as interview scripts or coding schemas (qualitative data), instance files or annotations (imaging), or sequence files (genomics data).

Lessons learned: Insufficient metadata makes data hard to find, interpret, integrate, or reuse, even by the original study, over time. Missing metadata reduces visibility and increases the risk of misinterpretation or misuse.

What to do:

Complete the HEAL SLMD form to share study-level metadata.
Submit VLMD, ideally in a data dictionary/codebook or appropriate non-tabular format.
If needed, consult a data curator from your organization’s libraries or from a data repository.

Additional resources:

HEAL Metadata Resources
Generate a HEAL-compliant Data Dictionary (HEAL Platform)
OSF’s How to create a data dictionary
The Data Curation Network’s Data Curation Primers

Analyze

Documenting data workflows

Well-documented workflows and consistent organization ensure transparency, continuity, and reproducibility. Undocumented workflows lead to confusion, errors, and hindered reproducibility.

Lessons learned: Inconsistent file structures, missing metadata, and undocumented processing increase the risk of errors and make it harder to track data provenance. Well-documented workflows enable validation, replication, and continuity, which supports compliance with the NIH DMS Policy’s emphasis on maximizing data sharing “of sufficient quality to validate and replicate the research findings.”

What to do:

Document each step of data processing and analysis through SOPs, annotated code and clear file naming, so they can be understood, replicated, and validated.
**Use version-controlled platforms (**GitHub, GitLab, or institutional equivalents) to track code, store it, and maintain reproducibility. Many data repositories version control deposited files.
Leverage software tools that capture and support workflow transparency, such as Laboratory Information Management Systems (LIMSs), electronic lab notebooks, workflow management tools (like Jupyter or R Markdown), and version-controlled repositories.
Refer to data type-specific guidance in resources like curation primers, scholarly articles, or data type-specific repository documentation.

Additional resources:

NIH Pragmatic Trials Collaboratory Living Notebook
Data Curation Network Data Curation Primers
Turing Way Guide to Reproducible Research

Describing key assumptions and analytical decisions

Transparent interpretation of research findings depends on clearly stating the assumptions, decisions, and frameworks that shaped the analysis.

Lessons learned: Unrecorded choices about how to handle missing data, which variables to include, and model selection can lead to data misinterpretation, misuse, or reproducibility concerns. Secondary users risk drawing invalid conclusions without this context.

What to do:

Document model selection criteria and explain the rationale for using a particular model (e.g., linear regression, mixed effects, logistic).
Specify inclusion/exclusion criteria (e.g., age thresholds, incomplete records), how they were applied, and any exceptions.
Explain treatment of missing data, whether they were excluded, imputed, or modeled, and how this may affect interpretability or generalizability.
Define analytic timepoints, measurement intervals, and units of analysis. For example, "30-day follow-up visits were used to define primary outcome."
Provide README files, inline code comments, or analysis notes/memos to capture assumptions for secondary users and reviewers.

Additional resource:

Software supporting code annotation: R Markdown, Jupyter Notebook

Version controlling research outputs

Use version control to connect datasets, scripts, and outputs to specific analyses and published findings. Clearly labeled, well-structured outputs (e.g. tables, charts, graphs, and other visualizations) support reproducibility and interpretability across publications, presentations, posters, and other mediums. The same data may apply to multiple similar outputs; for example, analytic results may appear on a poster several months before a different version is published in a journal.

Lessons learned: HEAL studies must share both data and research outputs appropriately. Outputs that lack context or links to underlying data and code limit reproducibility. Overwriting code or modifying data without documenting changes also impedes reproducibility and validation, undermining trust in the findings. Because the same data may generate multiple outputs, the study team and secondary users must be able to distinguish which versions of data and code files produced an output. For example, if a dataset is cited in a journal article, the version in the repository should match the one used in the publication.

What to do:

Use descriptive, consistent filenames and folder structures aligned with project phases or outputs**.** See “Organizing data…” under the Generate/Collect phase.
Employ version control (like GitHub), committing changes with descriptive messages and using releases to differentiate versions. Or, keep a changelog describing changes, like added variables, updated exclusion criteria, or software updates.
Include contextual metadata or a README with each output, noting the inputs, code / script used, file(s) containing associated VLMD, and other notes.
Link published outputs to underlying data using persistent identifiers, such as DOIs.

Choosing a data repository

Storing data in a trusted research repository supports reproducibility, access, and reuse. HEAL studies are expected to share data through one or more of the 29 HEAL-compliant data repositories, evaluated for HEAL-funded data. Note: the HEAL Data Platform is a catalog that links to HEAL data in compliant repositories; it is not a data repository.

Lessons learned: Repositories provide a range of services to researchers during submission; review protocols and available curation support before submitting data. Use different repositories for different types of study data (e.g. one repository for sequence or imaging data, and another for code/scripts) if needed. Include VLMD with your data. Your organization may have membership benefits with some data repositories.

What to do:

Identify specific data sharing needs, including data size, formats, and appropriate access controls or data curation services.
Identify repository options through the HEAL Repository Selection Tool.
Contact the HEAL Stewards for a study-specific repository recommendation.

Additional resources:

Curating shared data

According to the NNLM, “Data curation is composed of research data management and digital preservation and involves processes such as adding metadata to make data more findable and understandable, ingesting data into a [data] repository, … validating file checksums and file fixity checks, and other tasks for organizing, cleaning, describing, enhancing, storing, and preserving data.” Data curation transforms data used by the study team into forms appropriate for external sharing and long-term preservation.

Lessons learned: Under the NIH DMS Policy and HEAL Public Access and Data Sharing policy, HEAL studies must appropriately share scientific data underlying research findings, but not all raw data needs to be shared. Curation often involves HEAL studies transforming data (e.g. de-identification) and generating metadata. Starting early reduces the workload later.

What to do:

Determine what data to share, including data supporting publications, high-value data, or data required by the NIH HEAL Initiative.
Prepare data, determining if any data should be held back for privacy/security reasons, de-identifying/anonymizing data, generating or enhancing SLMD and VLMD, and reformatting data into open formats (e.g., CSV, PDF, PNG).
Choose an appropriate license for your shared data. Creative Commons licenses commonly define terms of use for published datasets.
Consult a data curator at your organization or data repository. Some will curate or de-identify data at cost, while others offer guidance.

Additional resources:

Managing data access & reuse

Open access allows anyone to freely access and reuse shared data (aka public access). Controlled and restricted access limit findability, accessibility, and reusability. Available access controls vary across data repositories. Data licenses or contracts can define allowable uses of shared data.

Lessons learned: HEAL studies may use open, controlled, or restricted access approaches. For sensitive data, consider repositories with secure platform protections and controlled access options.

What to do:

Decide on the best access option for your shared data (open, controlled, or restricted).
Choose a data repository with adequate access controls, such as access request mechanisms, enclaves (virtual/physical), temporary access/”visiting”, security controls, or others.
Define restrictions and establish licenses (Creative Commons for open access; Terms of service, Terms of use, DUAs or DSAs for controlled access), and define requirements around IP, citations, and allowable uses.

Additional resources:

The Open Data Handbook
QDR Access Controls (example)
Aalto University’s Practical Steps for Publishing Research Data

Linking data to publications

Publishing research data with journal articles supports replication and re-use and is often required by the publisher. It can also boost citation counts and enhance research impact. Persistent identifiers (PIDs) allow research artifacts to be linked and referenced across different locations, promoting findability.

Lessons learned: HEAL policy expects that “Underlying Primary Data for the Publications will be made broadly available through an appropriate data repository.” “Available upon request” statements are not HEAL compliant.

What to do:

Assign persistent identifiers, such as DOIs (Digital Object Identifiers), to datasets, publications, code, and other research outputs to ensure long-term findability and reference.
Include a publication data availability statement that points to the repository data deposit. For example: "The data that support the findings of this study are available from [repository name] with the following DOI: [DOI].”
Include associated publication DOIs with repository metadata.
Use data embargos if needed, before the publication is released. An embargo delays data visibility to others. Be sure to lift the embargo at the end of the performance period for HEAL compliance.

Promoting and measuring your data’s re-use

Sharing research data fosters collaboration and knowledge-building, cross-disciplinary discoveries, and public health progress. It enhances researcher visibility, supports career advancement, and increases publication/citation opportunities.

Lessons learned: Repositories often track dataset reuse metrics (views, downloads, citations), which may support tenure and other promotion considerations. Studies show that publications linked to repository-hosted data are up to 25% more likely to be cited.

What to do:

Link published data to your ORCID or researcher profile and include your ORCID in the repository deposit metadata.
Clearly specify copyright and/or citation requirements to ensure correct attribution upon re-use.
Measure your research impact, leveraging persistent identifiers and dataset reuse metrics to track data access and use.

Additional resources:

10 Reasons to Share Your Data (Nature)
Why share your research data? (Taylor & Francis)

Data Sharing Throughout the Research Lifecycle

Overview

Pre-Award

Writing a NIH data management and sharing plan (DMSP)

Budgeting for data management and sharing

Identifying organizational resources for data management and sharing

Start Up

Identifying data sharing goals and roles

Planning for sensitive data sharing, privacy, and confidentiality

Planning for ethical data sharing

Understanding data licensing and contracts

Choosing technology for data management

Generate / Collect

Organizing data for re-use

Using data standards and CDEs

Differentiating between raw and derived data

Sharing acquired data

Choosing open source vs proprietary software

Process

Tracking data provenance

Cleaning data

De-identifying and anonymizing data

Generating metadata

Analyze

Documenting data workflows

Describing key assumptions and analytical decisions

Version controlling research outputs

Share

Choosing a data repository

Curating shared data

Managing data access & reuse

Linking data to publications

Promoting and measuring your data’s re-use