Wellcome Trust and the Chan Zuckerberg Initiative Partner with DataCite to Build the Open Global Data Citation Corpus


Aggregated references to data across outputs will help the community monitor impact, inform future funding, and improve the dissemination of research

Amsterdam – 17 January 2023 – DataCite is pleased to announce that The Wellcome Trust has awarded funds to build the Open Global Data Citation Corpus to dramatically transform the data citation landscape. The corpus will store asserted data citations from a diverse set of sources and can be used by any community stakeholder.

The Make Data Count (MDC) initiative was established in 2014 to develop an infrastructure for open data metrics. A key learning from the initiative is that the community needs a clear understanding of data reuse to monitor impact, inform future funding, and improve the dissemination of research. The development of a trusted central aggregate of all references to research data across articles, preprints, government documents, and other outputs will help achieve this goal.

“Surfacing the reach and impact of published data is key to efforts to improve research culture and implement open research practices, as well as evidencing the true impact of research funding,” says Hannah Hope, Open Research Lead, Wellcome. “We are excited to fund DataCite’s project to produce an open data citation corpus from disconnected systems to address this evidence gap. Our shared ambition is that this corpus extends globally across all research domains, not just the biomedical sciences, we encourage potential data providers and users to engage with DataCite to enable this.”

“We are delighted to have relevant open datasets and algorithms generated by the Chan Zuckerberg Initiative included into this effort to build an open linked corpus for all data citations. DataCite is creating a unique opportunity for the research community to expand access to and understanding of the impact of datasets in an open linked scholarly output ecosystem,” says Patricia Brennan, Vice President of Product Management for Science Technology.

Matt Buys, Executive Director of DataCite reflects on the value of diversifying the source of data citations. “DataCite is ecstatic to be leading this project and the Make Data Count Initiative to further the global community efforts in creating meaningful data metrics. The Data Citation Corpus has the potential to dramatically alter the data citation landscape and we look forward to working closely with the community as we scale our efforts.”

This open CC0 corpus of data citations expands the scope beyond DataCite and Crossref metadata and includes both DOI and non-DOI (e.g., accession ID) data. Many datasets are only mentioned in an unstructured format in research articles. The CZI research team has developed a machine-learning algorithm to extract the data citations from the full journal article and preprint. This group makes data and code available to the community for reuse in support of a goal to strengthen the open scholarly infrastructure ecosystem and linked research outputs. 

The seed file will include accession numbers from the EuropePMC Corpus. According to Jo McEntyre, Associate Director, EMBL-EBI, “This initiative will bring together all identifiers for data, including DOIs and the life science Accession numbers mined from research articles in Europe PMC, into a broad, valuable community corpus. It will enable further understanding of data usage patterns and co-dependencies across disciplines in a way we have not been able to do before, as well as highlight the need to properly cite the use and reuse of data in publications.”

The project will launch on February 1, 2023. Contributions from the community will be integral to scaling the Open Global Data Citation Corpus. Interested community stakeholders are invited to join the virtual kick-off and participate in a conversation between DataCite, Wellcome Trust, Chan Zuckerberg Initiative, EMBL-EBI, COKI, OpenAIRE, and OpenCitations. You can watch the recording here (individual chapters and slides can be found in the video description).

About Wellcome Trust

Wellcome supports science to solve the urgent health challenges facing everyone. We support discovery research into life, health and wellbeing, and we’re taking on three worldwide health challenges: mental health, infectious disease and climate and health.

About Chan Zuckerberg Initiative

The Chan Zuckerberg Initiative was founded in 2015 to help solve some of society’s toughest challenges — from eradicating disease and improving education, to addressing the needs of our communities. Through collaboration, providing resources and building technology, our mission is to help build a more inclusive, just and healthy future for everyone. 

About DataCite

DataCite is a leading global non-profit membership organisation that provides persistent identifiers (DOIs) for research data and other research outputs. DataCite is an active participant in the research community and promotes data sharing and citation through community-building efforts and outreach activities.