The Combined Classics Library and Institute of Classical Studies, University of London, held a one-day workshop on Thursday July 11, 2019, related to the Cataloguing Open Access Classics Serials (COACS) project, which is now moving from journals to a focus on monographs. We invited colleagues from libraries, archives, bibliographic work and digital humanities. We were particularly keen to discuss the process of adding catalogue records for Open Access publications to specialist library catalogues, including:
- sources of open access publication data;
- pathways for ingesting data into our (and shared) catalogue formats;
- workflows for adding necessary metadata such as subject headings to records;
- issues around update and maintenance of records relating to OA publications;
- identification of other consumers of this catalogue data and mechanisms for sharing the outcomes of this work.
The COACS project was funded in 2017 by a strategic grant from the School of Advanced Study, which enabled us to hire a developer, Simona Stoyanova, to convert data from listings and web pages to MARC catalogue records, both to ingest into the ICS/HARL catalogue, and to make available for other institutions to re-use as desired. We received further support from the Classical Association in 2019 to consolidate this cataloguing work and begin planning the work on recording open monographs. In this time we have consulted with publishers, librarians and other heritage professionals on data standards and other technical needs.
The July workshop aimed both to address a series of questions that remain to be resolved, and to engage the community of our potential collaborators and users of our shared data. The discussion, focussed on five main areas:
- Introductions to the participants and the institutions represented (several libraries, including ICS/HARL, Warburg, Institute of Historical Research, Senate House Library, British Library, Bodleian, Sackler, Research Libraries UK; publishers, including Open Library of the Humanities and University of London Publications; and JISC). We solicited recommendations of (a) sources of data on open access (especially classical) journals, monographs and other titles, (b) data formats and standards that might be useful to investigate for both ingest and export, and (c) other aggregators and discovery tools for this kind of data. The major incompatibility between schemata used by publishers (such as ONIX) and libraries (primarily MARC and RDA) was noted as a serious hurdle to overcome.
- We then discussed pathways to ingest and transform records from the source datasets (whether aggregators, other catalogues, or scraped from individual websites). Questions that arose included to what degree ingests can be automated or need human attention, and what levels of deduplication are needed—a print journal and an online journal may be two separate cataloguable entities, but there will potentially still be items in any ingest dataset that are already in the catalogue, perhaps with different metadata. Guidelines for data formats and especially vocabularies were felt to be very desirable. There was still no consensus on how to record analytic level records (articles/stories/songs etc. within a longer work) in either MARC or ONIX.
- After lunch the discussion turned to processes for enhancing records and metadata added to the library catalogue; which fields are likely to need to be added in the library, many of which may be specific or idiosyncratic to the individual catalogue, including: classmark; subject heading; keyword; searchable text (abstract); ORCID identifiers. One difficulty that was still in need of a solution was how to cross-reference between two related fields in MARC, to make it clear for example that a print and electronic item were “the same” book. We discussed whether ingesting large amounts of imperfect/incomplete data was preferable to never being satisfied with the quality of metadata, and also pondered whether machine learning might (imperfectly) help to produce metadata such as subject headings for sparse records with large quantities of free-text available.
- In the discussion of data curation and persistence, the biggest question was what to do about links to open access items that have disappeared from the web—could broken links be detected/flagged by software, or is human intervention always needed? We did wonder whether encouraging all publishers, authors etc. to submit works to a stable web archive, and including links (or allowing software to create fallback links) to archived versions of a resource) might be a partial solution; DOIs might also serve part of this solution, but there are still fragile points in the chain. The other issue with persistence is when data has been updated or added to: new issues of a journal appear, a second or expanded edition of a book or non-print-like publication. In these cases, do we need to test and deduplicate records again? Do we update or replace records—and if the latter, what happens to the data that have been added or enriched by the librarian?
- The final section of discussion considered who are the potential consumers of the data on open access resources produced by library catalogues, and in particular what digital humanities research might be enabled by making these records available as freely and in as many formats as possible. We were given a quick overview of the workflow of JISC’s National Bibliographic Knowledge-base (NBK), slides from which are reproduced in the online minutes. We considered what value has been added by a library project such as COACS, and the key features were: license information, open formats, and scale of data all in one place. We then discussed briefly the value of citation indexes and other novel research in this area, and were given a demo of the Cited Loci of the Aeneid tool (sadly Matteo Romanello was not able to be with us in the afternoon, so he was not present to discuss his own work).
As is clear from the above, there were more questions than answers provided by this meeting, but it was a very valuable conversation. The minutes that we wrote collaboratively between us at the event, are online at tinyurl.com/COACS2019, with much more detail than this summary, lists of tools, projects and people, and links to many of the resources discussed. We hope that many of the participants will continue to stay in touch and consider how we can work together to these ends in the future.
The next steps for the COACS project are (i) to clean up and finally ingest the journals data into the ICS Library catalogue; (ii) decide how to expose the c. 50,000 article-level records to the catalogue, and make those available; (iii) begin to collect data on open access monographs, and see how feasible it is to create (and deduplicate) catalogue records for these as well. In the meantime, all of the MARC records and the Python scripts we used to create them are available from the COACS Git repository, licensed for you to re-use if they are useful to you.