Breaking a Metadata Barrier: Improving discoverability with automatic subject classification

https://doi.org/10.5438/vwp0-vz05

In the realm of scholarly publishing, the discipline metadata of outputs is of utmost importance. It is the backbone upon which works are discovered, indexed, and ultimately read. Without proper discipline metadata, outputs risk being lost in the vast sea of information in the scholarly landscape.

However, despite the importance of discipline metadata, it can be a source of frustration for both authors and publishers. The process of accurately identifying and assigning discipline metadata to a scholarly resource is a laborious and often subjective task, one that has traditionally been the responsibility of the publisher.

In a previous blog post (Garza, 2021), we talked about the state of standardized discipline metadata completeness in the DataCite metadata corpus. In this corpus discipline metadata is generally stored as subject metadata; we will use that term from now on. We showed that the state was suboptimal (only around 6% of DOI metadata includes disciplinary information in a standard manner) but also explained that DataCite was taking steps toward improving metadata completeness in the area. Back then, we invited you, our community, to follow standards and gave you examples of practices that could help you to improve subject metadata completeness in the repositories.

The adoption of these practices or the uptake of subject metadata has changed since, but not at the pace we would be hoping for. Organic adoption of subject metadata is limited to certain fields. Today, 15% of metadata deposits include standardized subject metadata; that is an +9% increase.


Fig. 1 Distribution of standardized DataCite DOI metadata by Field of Science. From (Garza, 2021)
Fig. 2 Adoption of standardized subject metadata by repositories as of April 2023.

One Solution

This is where the new feature comes in: a system that enriches subject metadata of scholarly outputs using scholarly publishers’ and scholarly repositories’ subject metadata. By leveraging the expertise of both single-subject publishers and repositories, this feature is able to accurately identify and assign subject metadata to a metadata deposit, making the process of discovery and indexing a work faster, more accurate, and less subjective.

The new feature works by using the subject classification of single-discipline repositories and applying it to the metadata deposits in the repository. This is a method DataCite devised in conjunction with bibliometricians from the Make Data Count and the Meaningful Data Counts projects. By matching these subjects to the subject metadata of works held by repositories, the feature can assign relevant and accurate subject metadata to the deposit in question. 

The Benefits for you

In practice, this means that a work that may have previously been difficult to discover and index due to a lack of accurate subject metadata can now be easily found and read by researchers and scholars. It is a win-win for both authors and publishers: authors can be assured that their work will be properly indexed and discovered, while publishers can spend less time and resources on the laborious task of assigning subject metadata. It is also beneficial to DataCite infrastructure such as the PID Graph or DataCite Commons where more accurate subject metadata leads to improved findability of the resource.

The new feature is already being integrated into the workflow of PID Graph API and DataCite Commons, and the early results are very promising. Today and thanks to this approach, we have an increase of subject metadata coverage to 20%in all metadata deposits. Thanks to this new feature, bibliometricians will be able to make more comprehensive studies in terms of the field of science, and users of DataCite Commons and PID Graph will have the ability to discover new works and expand their knowledge with ease.

Fig. 4 Distribution of enriched standardized Datacite DOI metadata by Field of Science. Today and thanks to this approach, 20% of all metadata deposits have standardized subject metadata. 
Fig. 3 Time distribution of Metadata deposits highlighting deposits with subject metadata (in yellow). Comparison in 2023 includes before (Organic) and post-enrichment (Improved). 

Managers of disciplinary repositories can utilize this feature to enrich their query metadata by accessing Fabrica, inputting their repository re3data ID in the settings, or incorporating disciplines directly into the repository form. This does not modify DOI metadata, but it is used to enhance search queries. Consult the support website for documentation for more details on how to update repository settings as a Direct Member or as a Consortium lead/Consortium Organization.

In the future, DataCite will explore applying similar approaches to other services, such as the REST API or Fabrica. We invite you to express your interest in such explorations in the DataCite roadmap.

In conclusion, the new feature that enriches subject metadata of scholarly outputs using scholarly publishers and scholarly repositories subject metadata is a stride forward in rendering scholarly works, identified with DataCite DOIs, more discoverable, more accessible, and more valuable to researchers and scholars everywhere.

References

Garza, K. (2021). Are You There, Metadata? It’s Me, the Bibliometrician. https://doi.org/10.5438/J4XV-Y945