Harmonizing Metadata Across Disciplines – Bioschemas and the DataCite Metadata Schema

https://doi.org/10.5438/vzqp-m504

Standards & More Standards

Metadata standards are fundamental to the semantic web and database management; they form the basis for data discovery, sharing, and organization across a range of domains. Semantic interoperability—a requirement for integrating various data sources, streamlining data retrieval procedures, and developing knowledge management applications—is made possible through the use of these standards1. However, there are challenges with the increasing number of both general and discipline-specific metadata standards. Dublin Core and other general standards have broad applicability across many disciplines, but they might not have the detail required for certain scientific fields. Conversely, discipline-specific standards provide thorough data descriptions for certain research fields but may hinder cross-domain data interoperability. Schemas.org and Bioschemas.org are two initiatives that try to close this gap by making sure that important information can be found through both general search engines and specialized databases, especially in the life sciences. 

Scientists are now using Schema.org, which was first created for general applications like e-commerce, to enhance the searchability of web-based data in particular domains. To modify and possibly incorporate these domain-specific vocabularies into Schema.org, community efforts are needed.

By extending Schema.org, Bioschemas aims to improve findability and data interoperability in the life sciences, particularly to meet the unique needs of targeted fields. Bioschemas addresses a fundamental challenge in the life sciences by adopting search engine-friendly structured data standards: making sure that valuable data are readily available not only through specialized scientific databases, but also through general search engines that are frequently used by researchers across various disciplines. It offers a detailed perspective on Schema.org usage, providing best practice documentation, including guidance on property usage, definitions, recommendations (whether a property is recommended, required, or optional), and cardinality (the number of times a property can appear), accompanied by example code. Bioschemas encourages the community to participate in extending Schema.org to cover specific life sciences domains and subdomains, and is guided by a steering committee that approves proposals based on use cases and work plans. Successful working groups demonstrate community adoption and example implementations.

The proliferation of metadata standards across disciplines introduces significant interoperability challenges and undermines their intended purpose2. The integration of data from various sources is made more difficult by semantic heterogeneity, where different standards describe the same idea in different ways3

How can this be resolved? 

Using mapping and crosswalking techniques is one workable way to deal with the interoperability issues caused by the proliferation of metadata standards. These techniques attempt to establish correspondences or “maps” between different metadata standards, enabling the translation or “crosswalk” of metadata from one standard to another4. This process facilitates the integration of data from diverse sources by ensuring that metadata encoded according to different standards can be understood and used across systems. 

Mapping involves the identification of equivalent, similar, or related concepts among different metadata standards. It requires a deep understanding of the semantics of the elements within each standard, as well as the context in which they are used. Once a mapping is established, crosswalks are developed to specify how elements in one metadata standard correspond to elements in another, including rules for transforming data values to fit the target standard’s expectations. Consider “mapping” as the intellectual activity of comparing and analyzing two or more metadata schemas, and a “crosswalk” as the visual and textual product of the mapping process5 6

DataCite efforts

DataCite aims to connect discipline-generic and discipline-specific schemas, which is significant for enhancing the findability, accessibility, interoperability, and reusability (FAIR) of research data7. We seek to enable a more unified and effective approach to data management and sharing across diverse scientific domains by helping to tackle the challenges posed by metadata schema proliferation and the requirement for interoperability amongst disparate data standards. An example of this is the domain-specific metadata template for Cognitive Neuroscience research, developed as part of the Implementing FAIR Workflows project on the CEDAR platform and connected to neuroscience research data. Furthermore, as part of our commitment to the FAIR Principles and aligned with our joint work in the FAIR-IMPACT project (both DataCite and the University of Manchester are part of the project consortium), we are also working together to design a framework to create, document, and share semantic artifact crosswalks and mappings. Another prime example of this is the crosswalk developed between the Registration and Descriptive metadata schemata of the International Generic Sample Number (IGSN ID) and the DataCite Metadata Schema through our partnership with IGSN e.V. In a similar way, as part of this effort, we have developed a crosswalk between the proposed Biosamples Type and the current DataCite Metadata Schema 4.58

Combined community efforts

Through discussions with important national and European “sample” resources, we collaborated with institutions such as the EMBL-EBI European Nucleotide Archive, BioSchemas.org, and representatives from Biodiversity Genomics Europe (BGE) and German biodiversity community (Leibniz Institute for the Analysis of Biodiversity Change, Museum Koenig Bonn) to examine how the life science community describes biosamples or biospecimens (we use these terms interchangeably). The initiative addresses limitations in the existing Sample Profile by proposing a BioSample DRAFT Profile. Although not yet officially recognized by Schema.org, steps are being taken to integrate the Biosamples Type, requiring community consensus and validated use cases. Workshops at events including the Elixir Biohackathon and the 2nd Biohackathon Germany have been instrumental in identifying essential metadata elements for sample description.

Our goal is to put forward recommendations to support alignment between DataCite metadata with the Bioschemas-Biosample Type ultimately leading to improved FAIRness of biosample data. This effort focuses on making biosample data easier to find through search engines and specialized databases, enhancing metadata with detailed information, employing persistent identifiers (PIDs), ensuring uniform access through standard protocols, achieving interoperability with standardized data models, linking semantically to related datasets, facilitating discovery across disciplines, and encouraging reuse with comprehensive metadata9

These samples can be physical samples, as well as digital representations such as genomic sequences or digital twins of biosamples, which represent detailed digital versions of physical samples and offer benefits like extended preservation, detailed annotation, and interoperability10. PIDs are crucial for linking digital twins with their physical counterparts, ensuring unique reference, persistent access, and integration across databases and studies. Furthermore, metadata standards ensure consistent, detailed, and standardized descriptions of biosamples and their digital twins11

What’s next?

The outcome of these community efforts will be a community consultation spearheaded by the Bioschemas community, which will open the door for community endorsement and adoption of real-world use cases. Hopefully, this will result in Biosamples Type being integrated into Schema.org and can also provide the foundation of a recommendation for the life sciences community to support the translation of biosample metadata into the DataCite Metadata Schema when registering IGSN IDs through DataCite services

Please keep an eye out for the project’s future developments!


References

  1. R. Lubas, A. Jackson, and I. Schneider, The Metadata Manual: A Practical Workbook. Elsevier, 2013. ↩︎
  2. L. M. Chan and M. L. Zeng, “Metadata Interoperability and Standardization – A Study of Methodology Part I,” D-Lib Magazine, vol. 12, no. 6, Jun. 2006, doi: 10.1045/june2006-chan. ↩︎
  3. A. Doan, A. Halevy, and Z. Ives, “Schema Matching and Mapping,” in Principles of Data Integration, Elsevier, 2012, pp. 121–160. Accessed: Mar. 05, 2024. [Online]. Available: https://doi.org/10.1016/b978-0-12-416044-6.00005-3 ↩︎
  4. I. Xie and K. K. Matusiak, “Metadata,” in Discover Digital Libraries, Elsevier, 2016, pp. 129–170. Accessed: Mar. 05, 2024. [Online]. Available: https://doi.org/10.1016/b978-0-12-417112-1.00005-3 ↩︎
  5. A. B. Zhang and D. Gourley, “Creating metadata,” in Creating Digital Collections, Elsevier, 2009, pp. 73–88. Accessed: Mar. 05, 2024. [Online]. Available: https://doi.org/10.1016/b978-1-84334-396-7.50006-7 ↩︎
  6. “Introduction to Metadata: Metadata Matters: Connecting People and Information.” https://www.getty.edu/publications/intrometadata/metadata-matters/ ↩︎
  7. M. D. Wilkinson et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Scientific Data, vol. 3, no. 1, pp. 1–9, Mar. 2016, doi: 10.1038/sdata.2016.18. ↩︎
  8. K. Stathis, C. Ross, B. Dreyer, and P. Vierkant, “DataCite Metadata Schema 4.4 to Schema.org Mapping,” Zenodo, Dec. 20, 2022. https://doi.org/10.5281/zenodo.7661399 ↩︎
  9. S. El-Gebali, R. Macneil, R. Edmunds, P. Tewatia, and J. Klump, “Biospecimens in FDO world,” Research Ideas and Outcomes, vol. 8, Oct. 2022, doi: 10.3897/rio.8.e94544. ↩︎
  10. E. Schultes et al., “FAIR Digital Twins for Data-Intensive Research,” Frontiers in Big Data, vol. 5, May 2022, doi: 10.3389/fdata.2022.883341.
    ↩︎
  11. D. Peters and S. Schindler, “FAIR for digital twins,” CEAS Space Journal, pp. 1–8, May 2023, doi: 10.1007/s12567-023-00506-y.
    ↩︎
Sara El-Gebali
Metadata Specialist at DataCite | Blog posts
A headshot of Nick Juty.
Nick Juty
The University of Manchester | Blog posts
Rorie Edmunds
Samples Community Manager at DataCite | Blog posts
Gabriela Mejias
Community Manager at DataCite | Blog posts
Photo of Kelly Stathis
Kelly Stathis
Technical Community Manager at DataCite | Blog posts