Write an Open Data Paper!
Gabriel Bodard and Barbara McGillivray
Digital Humanities scholars have long recognised that digital research data is both most useful, and most likely to be disseminated and therefore sustainable in the long term, if it is freely available and openly licensed for creative and transformative reuse [e.g. Cayless 2010]. To this we would add that the potential for reuse and preservation is much higher if people know about your data.
Many high profile datasets coming out of Digital Classics projects are licensed for reuse precisely because their value lies at least as much (if not more) in the potential for others to exploit and build on them, as in their status as a fixed output of a single research process. Just to give a couple of illustrative examples:
- The Diorisis Ancient Greek Corpus is a digital collection of ancient Greek texts compiled for linguistic analysis, with the purpose of developing a computational model of semantic change in Ancient Greek [McGillivray/Vatri 2018]. This corpus (itself built on several open data resources) will enable others to address a variety of research questions about the Ancient Greek language, for example on the evolution of Ancient Greek terms in specific areas such as religion, and is already being used by Ancient Greek linguistics scholars.
- Vanessa Gorman has morpho-syntactically annotated half a million words of Ancient Greek literature, and made the resulting treebanks freely available through the Perseus Ancient Greek Dependency Treebank and her own Github repository [Gorman 2019]. These trees, alone or alongside the rest of the Perseus AGDT corpus, may be queried or processed for linguistic and stylistic information, can help answer questions about Greek morphology and syntax or authorship attribution, and can also be used in pedagogical contexts.
- In 2017, the Epigraphic Database Heidelberg (EDH), a project of more than 30 years standing to publish in digital form the inscriptions of the Roman Empire, now in danger of losing its funding, released all of its contents as open-licensed, open data in standard formats (EpiDoc XML; GeoJSON; CSV; RDF, including Lawd, Pelagios, Cidoc, Snap) [EDH Open Data Repository]. This was conceived by the EDH’s Frank Grieshaber to protect against the loss of data if the database were taken offline, but it was also picked up by many scholars in digital classics as a call to arms: the Open Epigraphic Data Unconference, held in London and (remotely) worldwide kicked off at least half a dozen mini-projects reusing and building on EDH data, and in turn made a compelling argument for the importance of the project (which subsequently had their funding renewed for three years, and a commitment to keep the database online and stable thereafter).
As can be seen in these examples, publication of open, transparent and licensed data can have positive impacts on reach, dissemination, sustainability, research value, standards development, student engagement, and development of new projects. Most of these projects and datasets are neither the start- nor end-point of the data reuse process; they are both enabled by existing open-licensed resources, and in turn pay it forward by enabling future work, whether it involves the original authors in any capacity or not.
As mentioned, beyond making data available, licensing it appropriately to liberate it for free reuse, and attaching robust metadata, there is the important question of documenting and disseminating the processes behind the creation of the data itself. One way to publish this invaluable information and further increase the visibility of your work is to write a data paper, a publication which describes a dataset or resource which is openly available in a repository and which has potential for reuse.
The Journal of Open Humanities Data (JOHD) is a growing open-access peer-reviewed academic journal specifically dedicated to data papers for Humanities research. JOHD publishes two types of papers: short data papers, 1000-word descriptions of a dataset or resource, and full-length research papers, articles between 3000 and 5000 words discussing methods and challenges in the creation or analysis of datasets in Humanities research. The web page has more information on submission guidelines and publication fees. JOHD has recently published its first article dedicated to a Digital Classics resource, Dependency Treebanks of Ancient Greek prose by Vanessa Gorman, which received 70 views and 8 downloads in just a week. Given their potential to shape new ways of doing research in Digital Classics, publishing data papers in Digital Classics is a strategically important focus area for JOHD.
The reuse of open data has a positive impact on the authors of the original research, giving further recognition of their work, both through citation (required by almost all open data licenses) and measurable metrics of impact. As well as increasing the visibility of your dataset, a data article is a publication that can be consulted and of course cited by those building on and transforming your digital resources. In this world where more and more academic work is in digital form or computationally enabled, the transparency of reuse and data documentation is an essential part of the scientific process of research, argumentation, publication and citation [Bodard/Garcés 2008].
We invite you to join the open data community by submitting a paper to JOHD describing any open datasets that you have created and made available in public repositories, and that you think can be of interest and value to others.