This week I start as the new DataCite Technical Director. While I get up to speed with existing DataCite services and infrastructure, and we start to launch new services (e.g. this blog), this is also a good time to communicate the overall approach I am taking. I like to call it Data-Driven Development, or DDD as we all love acronyms.
Data-Driven Development and related terms are in use in several contexts, in particular economics, and programming. The term sounds similar to test-driven development and behavior-driven development, two related software development processes. Business intelligence and data science are of course closely related. My definition is as follows:
We develop and maintain our services based on data.
This shouldn’t come as a surprise as DataCite’s mission is Helping you to find, access and reuse data. And my last job at the Open Access publisher PLOS was all about collecting and presenting data about the reuse of scholarly articles (citations, downloads, social media mentions, etc.). But here I mean data in a much broader sense.
While the overall strategic direction is determined by the Board together with the DataCite working groups and members, we can collect data that help with decisions in product development, for example
- service monitoring (see below): how are our services used over time, are there any components that are particularly popular, etc.
- user feedback: ideas, feedback, A/B testing
- bug reports
- discussion boards and direct group messages: related to the last two points, but more allowing a more open discussion
- community events
Compared with the next two sections, tools for data-driven product development are less commonplace (unless I missed them, in which case please provide feedback).
The data generated during software development are increasingly made available through automated tools. We can
- get detailed information out of the version control system
- check for passing and failing tests in continuous integration servers
- check test coverage and overall code quality
- check for consistent coding style
Any web-based service can and should be monitored for
- crashes and other serious errors
- server load, server outages and internal server problems
- server traffic, including traffic to particular pages, percentage of mobile and non-English users, etc.
- specific monitoring for the services you are offering, e.g. in the case of DataCite number of DOIs registered (broken down by data center), number of DOIs with specific metadata (e.g. ORCID identifiers for creators and funding information), and number of DOI resolutions (tricky because there is no easy way to filter out bots)
- user-generated feedback (see section product development)
We don’t want to stop at collecting all these data, we also need a strategy for providing them to the DataCite Board, DataCite working groups, DataCite members and data centers, DataCite staff, and everyone else who cares about these data. The default should be open, exceptions are mostly data that would raise privacy or security concerns, e.g. IP addresses in usage stats. Most of the services mentioned in this post are open for everyone to look at.
Good data-driven development should not only collect lots of data and make them available, but we also need to aggregate the information in meaningful ways. Service monitoring is a good example where staff needs to understand exactly what is going on, but the typical DataCite user only cares about whether all services are running as expected. A status dashboard would be a good solution here.
The data we are generating also need to be put into the broader context. We need
- the DataCite Board to use them for strategic planning
- to provide these data to the DataCite working groups to feed into their work (e.g. stats on what metadata are submitted by data centers for the Metadata Working Group
- the DataCite staff to integrate them in their work (e.g. the Communications Director utilizing the website usage stats)
- these data to adapt the software development roadmap and service infrastructure
Of course I am aware that this is an ambitious agenda, in particular since DataCite is a small non-profit that has limited staff and financial resources. But I don’t think that data-drive development should be left to for-profit organizations and/or to organizations of a certain size. There are several things DataCite can do:
- implement DDD practices over time, starting with one service and one aspect
- use service providers wherever it makes sense (there is a future where you yourself are running less servers). This means anything that is not core to the DataCite mission and where the service provider is better and/or cheaper than what you could do internally. This evaluation can of course change over time
- collaborate with other scholarly non-profits on infrastructure, including DataCite members and data centers, and other persistent identifier providers such as CrossRef and ORCID