DataCite – GigaScience internship: how to improve metadata completeness

https://doi.org/10.5438/jvmx-7e44

Hello! This is Kelvin, I interned with DataCite remotely in Hong Kong this summer. After this internship, I will continue the last year of my Bachelor of Computer Engineering at Hong Kong University of Science and Technology. My specialization is in Data Science. Yet this is my first trial of working with metadata, and it is a lot more interesting than I thought it would be — in the best way possible. It is really enjoyable to work here.

My work location is at the office of GigaScience, which is located at Shek Mun, Hong Kong. My responsibility for this internship is mainly about performing metadata completeness analysis on GigaScience’s datasets. Though not working at the same office nor in the same time zone, Kristian Garza from the Product Engineering team and I have worked closely together on the tasks. He helped me a lot including, but not limited to, getting me familiar with DataCite DOI services and guiding me to solve technical problems. Also thanks to the support from GigaScience, I could have sufficient resources to review their datasets and make some outputs.

In a nutshell, I identified the missing fields in GigaScience’s metadata and the underlying reasons. To achieve that, I also developed some tools during the process, like a python script and some dashboard objects. The outcomes were presented to DataCite and GigaScience.

To provide a brief overview on the outcomes:

Python Script to help load data from XML files: https://github.com/kelvinlyy/drax

Dashboards developed for analysing metadata completeness:

Comparison on completeness of GigaScience metadata to DataCite average metadata by property:

9 GigaScience and DataCite fields average are about the same in that property

18 GigaScience fields have less missing metadata than DataCite average in that property

29 GigaScience fields have more missing metadata than DataCite average in that property

… And more in the full Intern Activities Report

Besides, I am impressed by the concept of Open Science that DataCite works towards. It enables the involvement of everyone to collaborate in and contribute to the research community. I believe it would ultimately benefit the good of mankind. I am happy that DataCite is reaching out to Asia. In this internship, I also did some outreach activities for DataCite in the Chinese language community. Hopefully, it will be the starting point for further Open Science activities in Asia. I look forward to seeing that while I continue my studies.

Lai Yung Yin Kelvin
Intern at DataCite | Blog posts