Million Books Workshop (brief report)

Imperial College London.
Friday, March 14, 2008.

David Smith gave the first paper of the morning on “From Text to Information: Machine Translation”. The discussion included a survey of machine translation techniques (including the automatic discovery of existing translations by language comparison), and some of the value of cross-language searching.

[Please would somebody who did not miss the beginning of the session provide a more complete summary of Smith’s paper?]

Thomas Breuel then spoke on “From Image to Text: OCR and Mass Digitisation” (this would have been the first paper in the day, kicking off the developing thread from image to text to information to meaning, but transport problems caused the sequence of presentations to be altered). Breuel discussed the status of professional OCR packages, which are usually not very trainable and have their accuracy constrained by speed requirements, and explained how the Google-sponsored but Open Source OCRopus package intends to improve on this situation. OCRopus is highly extensible and trainable, but currently geared to the needs of the Google Print project (and so while effective at scanning book pages, may be less so for more generic documents). Currently in alpha-release and incorporating the Tesseract OCR engine, this tool currently has a lower error-rate than other Open Source OCR tools (but not the professional tools, which often contain ad hoc code to deal with special cases). A beta release is set for April 2008, which will demo English, German, and Russian language versions, and release 1.0 is scheduled for Fall 2008. Breuel also briefly discussed the hOCR microformat for describing page layouts in a combination of HTML and CSS3.

David Bamman gave the second in the “From Text to Information” sequence of papers, in which he discussed building a dynamic lexicon using automated syntax recognition, identifying the grammatical contexts of words in a digital text. With a training set of some thousands of words of Greek and Latin tree-banked by hand, auto-syntactic parsing currently achieves an accuracy rate something above 50%. While this is still too high a rate of error to make this automated process useful as an end in itself, to deliver syntactic tagging to language students, for example, it is good for testing against a human-edited lexicon, which provides a degree of control. Usage statistics and comparisons of related words and meanings give a good sense of the likely sense of a word or form in a given context.

David Mimno completed the thread with a presentation on “From Information to Meaning: Machine Learning and Classification Techniques”. He discussed automated classification based on typical and statistical features (usually binary indicators: is this email spam or not? Is this play tragedy or comedy?). Sequences of objects allow for a different kind of processing (for example spell-checking), including named entity recognition. Names need to be identified not only by their form but by their context, and machines do a surprisingly good job at identifying coreference and thus disambiguating between homonyms. A more flexible form of automatic classification is provided by topic modelling, which allows mixed classifications and does not require the definition of labels. Topic modelling is the automatic grouping of topics, keywords, components, relationships by the frequency of clusters of words and references. This modelling mechanism is an effective means for organising a library collection by automated topic clusters, for example, rather than by a one-dimensional and rather arbitrary classmark system. Generating multiple connections between publications might be a more effective and more useful way to organise a citation index for Classical Studies than the outdated project that is l’Année Philologique.

Simon Overell gave a short presentation on his doctoral research into the distribution of location references within different language versions of Wikipedia. Using the tagged location links as disambiguators, and using the language cross-reference tags to compare across the collections, he uses the statistics compiled to analyse bias (in a supposedly Neutral Point-Of-View publication) and provide support for placename disambiguation. Overell’s work is in progress, and he is actively seeking collaborators who might have projects that could use his data.

In the afternoon there were two round-table discussions on the subjects of “Collections” and “Systems and Infrastructure” that I may report on later if my notes turn out to be usable.

This entry was posted in Conferences, Projects. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *