Participants: Sarah Middle, Angie Lumezeanu, Simona Stoyanova
Report: https://github.com/EpiDoc/OEDUc/wiki/Images-and-image-metadata
The Images and Image Metadata working group met at the London meeting of the Open Epigraphic Data Unconference on Friday, May 15, 2017, and discussed the issues of copyright, metadata formats, image extraction and licence transparency in the Epigraphik Fotothek Heidelberg, the database which contains images and metadata relating to nearly forty thousand Roman inscriptions from collections around the world. Were the EDH to lose its funding and the website its support, one of the biggest and most useful digital epigraphy projects will start disintegrating. While its data is available for download, its usability will be greatly compromised. Thus, this working group focused on issues pertaining to the EDH image collection. The materials we worked with are the JPG images as seen on the website, and the images metadata files which are available as XML and JSON data dumps on the EDH data download page.
The EDH Photographic Database index page states: “The digital image material of the Photographic Database is with a few exceptions directly accessible. Hitherto it had been the policy that pictures with unclear utilization rights were presented only as thumbnail images. In 2012 as a result of ever increasing requests from the scientific community and with the support of the Heidelberg Academy of the Sciences this policy has been changed. The approval of the institutions which house the monuments and their inscriptions is assumed for the non commercial use for research purposes (otherwise permission should be sought). Rights beyond those just mentioned may not be assumed and require special permission of the photographer and the museum.”
During a discussion with Frank Grieshaber we found out that the information in this paragraph is only available on this webpage, with no individual licence details in the metadata records of the images, either in the XML or the JSON data dumps. It would be useful to be included in the records, though it is not clear how to accomplish this efficiently for each photograph, since all photographers need to be contacted first. Currently, the rights information in the XML records says “Rights Reserved – Free Access on Epigraphischen Fotothek Heidelberg”, which presumably points to the “research purposes” part of the statement on the EDH website.
All other components of EDH – inscriptions, bibliography, geography and people RDF – have been released under Creative Commons Attribution-ShareAlike 3.0 Unported license, which allows for their reuse and repurposing, thus ensuring their sustainability. The images, however, will be the first thing to disappear once the project ends. With unclear licensing and the impossibility of contacting every single photographer, some of whom are not alive anymore and others who might not wish to waive their rights, data reuse becomes particularly problematic.
One possible way of figuring out the copyright of individual images is to check the reciprocal links to the photographic archive of the partner institutions who provided the images, and then read through their own licence information. However, these links are only visible from the HTML and not present in the XML records.
Given that the image metadata in the XML files is relatively detailed and already in place, we decided to focus on the task of image extraction for research purposes, which is covered by the general licensing of the EDH image databank. We prepared a Python script for batch download of the entire image databank, available on the OEDUc GitHub repo. Each image has a unique identifier which is the same as its filename and the final string of its URL. This means that when an inscription has more than one photograph, each one has its individual record and URI, which allows for complete coverage and efficient harvesting. The images are numbered sequentially, and in the case of a missing image, the process skips that entry and continues on to the next one. Since the databank includes some 37,530 plus images, the script pauses for 30 seconds after every 200 files to avoid a timeout. We don’t have access to the high resolution TIFF images, so this script downloads the JPGs from the HTML records.
The EDH images included in the EAGLE MediaWiki are all under an open licence and link back to the EDH databank. A task for the future will be to compare the two lists to get a sense of the EAGLE coverage of EDH images and feed back their licensing information to the EDH image records. One issue is the lack of file-naming conventions in EAGLE, where some photographs carry a publication citation (CIL_III_14216,_8.JPG, AE_1957,_266_1.JPG), a random name (DR_11.jpg) and even a descriptive filename which may contain an EDH reference (Roman_Inscription_in_Aleppo,_Museum,_Syria_(EDH_-_F009848).jpeg). Matching these to the EDH databank will have to be done by cross-referencing the publication citations either in the filename or in the image record.
A further future task could be to embed the image metadata into the image itself. The EAGLE MediaWiki images already have the Exif data (added automatically by the camera) but it might be useful to add descriptive and copyright information internally following the IPTC data set standard (e.g. title, subject, photographer, rights etc). This will help bring the inscription file, image record and image itself back together, in the event of data scattering after the end of the project. Currently linkage exist between the inscription files and image records. Embedding at least the HD number of the inscription directly into the image metadata will allow us to gradually bring the resources back together, following changes in copyright and licensing.
Out of the three tasks we set out to discuss, one turned out to be impractical and unfeasible, one we accomplished and published the code, one remains to be worked on in the future. Ascertaining the copyright status of all images is physically impossible, so all future experiments will be done on the EDH images in EAGLE MediaWiki. The script for extracting JPGs from the HTML is available on the OEDUc GitHub repo. We have drafted a plan for embedding metadata into the images, following the IPTC standard.