Million Books Workshop, Friday, March 14, 2008, Imperial College London.
In the afternoon, the first of two round table discussions concerned the uses to which massive text digitisation could be put by the curators of various collections.
The panellists were:
- Dirk Obbink, Oxyrhynchus Papyri project, Oxford
- Peter Robinson, Institute for Textual Scholarship and Electronic Editing, Birmingham
- Michael Popham, Oxford University Library Services
- Charlotte Roueché, EpiDoc and Prosopography of the Byzantine World, King’s College London
- Keith May, English Heritage
Chaired by Gregory Crane (Perseus Digital Library), who kicked off by asking the question:
If you had all of the texts relevant to your field—scanned as page images and OCRed, but nothing more—what would you want to do with them?
- Roueché: analyse the texts in order to compile references toward a history of citation (and therefore a history of education) in later Greek and Latin sources.
- Obbink: generate a queriable corpus
- Robinson: compare editions and manuscripts for errors, variants, etc.
- Crane: machine annotation might achieve results not possible with human annotation (especially at this scale), particularly if learning from a human-edited example
- Obbink: identification of text from lost manuscripts and witnesses toward generation of stemmata. Important question: do we also need to preserve apparatus criticus?
- May: perform detailed place and time investigations into a site preparatory to performing any new excavations
- Crane: data mining and topic modelling could lead to the machine-generation of an automatically annotated gazeteer, prosopography, dictionary, etc.
- Popham: metadata on digital texts scanned by Google not always accurate or complete; not to academic standards: the scanning project is for accessibility, not preservation
- Roueché: Are we talking about purely academic exploitation, or our duty as public servants to make our research accessible to the wider public?
- May: this is where topic analysis can make texts more accessible to the non-specialist audience
- Brian Fuchs (ICL): insurance and price comparison sites, Amazon, etc., have sophisticated algorithms for targeting web materials at particular audiences
- Obbink: we will also therefore need translations of all of these texts if we are reaching out to non-specialists; will machine translation be able to help with this?
- Roueché: and not just translations into English, we need to make these resources available to the whole world.
(Disclaimer: this summary is partial and partisan, reflecting those elements of the discussion that seemed most interesting and relevant to this blogger. The workshop organisers will publish an official report on this event presently.)