[Note: this is part of a paper written after a conference on Digital Lexicography at the University of Cambridge in 2002, and was scheduled to appear in the print publication of the proceedings. As the publication never took place, and the paper is now rather too out of date to publish by traditional means without a lot more work, I’m posting here under a Creative Commons Attribution license that part of it (a little more than a third of the length) that might still be of some small interest. No significant changes have been made to this material since 2003 (e.g. code examples use TEI P4).]
Introduction
In this paper I discuss the digital markup of epigraphic texts, using the Aphrodisias in Late Antiquity 2004 electronic publication as an example corpus. I shall consider some of the uses to which the original electronic source code can be put, which includes the compiling of (or contributing to) indices and databases external to the original, limited project. Such external uses might include an onomastic database, a gazetteer of place names, or a digital lexicon, to suggest only three.
[I omit from this version the first two sub-sections of the paper (history of EpiDoc and thoughts on digital publication) which have now been more fully explored in this DHQ chapter and my Digital Medievalist article, respectively]
2. Onomastic and Lexicographic Markup
The aspect of the marked up inscriptions that may be of most relevance to the topic of this volume is the ability to mark-up words and names for indexing, searching, and export to external databases, such as lexica, prosopographies, and gazetteers. It is important to note that it is not necessary for the author of the electronic publication to predict these uses, for it to be possible to exploit the resource in this way. I shall give here fairly brief examples of how names are marked up and how information might be extracted from them (using the Lexicon of Greek Personal Names database as an example of the format and structure that such an output might take), followed by a similar example of the markup of two previously unattested words in the Aphrodisias material. (I should note that the LGPN database already contains data on the inscriptions of Asia Minor, so the following description is by way of example only, not how the database will be built in this case.)
2.i Personal Names
Personal names in the ALA2004 corpus are marked up in XML using the tag <persName>. This tag allows the user to specify two important pieces of information: the regularised form of the name, and a database key pointing to an authority list of individuals. The name of Asclepiodotus in inscription 54, used for our example above, is marked up as follows:
<persName key=”AsclepiodotusAph” reg=”Ἀσκληπιόδοτος” type=”Aphrodisian”>Ἀσκληπιόδοτος</persName>
The regularised form of the name is in this case identical to the form in the inscription, since there are no variant spellings and the name is in the nominative case. The @type=”Aphrodisian” attribute is for internal purposes, to indicate in which of the indices this name belongs, although it could also be used for prosopographical sorting. The authority list, to which the key “AsclepiodotusAph” points, gives the following additional information:
ID: AsclepiodotusAph
Full Name: Asclepiodotus of Aphrodisias
Extra Info: philosopher, and benefactor, father of Damiane: PLRE II, Asclepiodotus 2
Now if a prosopographical database such as the LGPN were to use the electronic files of these inscriptions marked up in EpiDoc XML to extract data on personal names and people, much of the information needed could be automatically generated by a simple XML parser. The LGPN main database has five fields: name, place, date, reference, ‘final brackets’ (for miscellaneous additional information). Four of these fields could be filled in, at least to a preliminary standard, by the parser.
The ‘name’ field contains the personal name in Greek, which is the content of the @reg attribute in the XML, i.e. Ἀσκληπιόδοτος. The ‘place’ field is dependent on contextual information, it could be extracted by a human from the epithet in the authority list; however it could also be automated to the extent that all names extracted from eALA inscriptions with the attribute value @type=”Aphrodisian”, are from Aphrodisias. The ‘date’ field is more complicated: the LGPN uses a series of encoded values which expand to a range of dates in the database; for example, 4A (=fourth century A.D.), translates to 300-400. Luckily, dates in EpiDoc files are handled similarly, if not using the same system: the date of our example inscription is listed both (in prose) as ‘late fifth century A.D.’, and (behind the scenes) as 467-500. The ‘reference’ field lists the primary reference for the name, in this case our inscription; a parser can easily extract the information: ALA2004 54, 2 (and analysis of the full corpus will reveal that it also occurs in 53).
The content of the ‘final brackets’ field would almost certainly have to be generated by a human editor (just as the first four fields would need to be checked). If the name is not in the nominative in the text, its value could be imported from the content of the <persName> element. Secondary reference, such as this person’s reference in the PLRE could also be extracted from the authority list.
The process of extracting the necessary information for a record in a prosopographical database is not therefore fully automated: editorial intervention and checking will always be required. The work of a parser working with files in an interchange format like EpiDoc can nonetheless speed the work up, creating preliminary records that only need collating and editing, along with most of the reference information to make the editor’s job easier. The electronic file does not do away with the author, but it provides an extra tool to facilitate their work.
2.ii New Lexical Words
The EpiDoc system also allows for the option to mark-up individual lexical words, as is done in the ALA2004 project. The word in the text is enclosed in <w> and </w> tags (standing for ‘word’), which has an attribute @lemma=”x”. The @lemma attribute is equivalent to the @reg attribute of <persName>: it gives the word in the nominative, normalized spelling, first person, etc., as a dictionary lemma. In ALA2004 the principal purpose of this lemmatizing is for the generation of indices, as well as personal and place names, the publication includes an index of Greek words. (Not all words are indexed of course, but they are all marked up for completeness; the stylesheet can be told not to index words like ὁ, καί, δέ, and so forth.)
As an aside, there is software, such as that developed by Perseus, which will lemmatize a Greek text with a greater or lesser degree of automation. This software needs to refer to a digitized dictionary, and so is of limited value when it comes to new words, errors, or misspellings; even in a perfectly clear text a human editor will need to resolve ambiguous forms. In inscriptions especially, the exceptions will be many, but work is ongoing to speed the process of lemmatizing by means of such tools.
Although ALA2004 has not yet marked up any further grammatical information, such as part of speech, syntax or linguistic structure (nor, as far as I know, has any other EpiDoc project), this possibility is not precluded in the future. There are other projects in both Greek (ancient and modern) and Latin, that use TEI to encode linguistic features from grammar, morphology and syntax, to narrative structures and discourse features in their texts. Such information could further enrich the value of a digital text for lexicographic use.
Inscriptions are a particularly rich source of previously unattested words, words that might be added to a supplement or new edition of a Greek (or Latin) lexicon. The two hundred and fifty inscriptions in ALA2004 provide several new Greek words, of which I shall here give two as an example. The words in the index whose lemmata do not appear in the standard lexica, but are interpretable in their contexts are σελλοφόρος and μυδροστασία.
Automatic extraction from the text could provide both the lemma, and the attested form, since the words would occur in the files in the following forms:
<w lemma=”ὁ”>τῶν</w> <w lemma=”σελλοφόρος”>σελλοφόρων</w>
<w lemma=”ὁ”>τῆς</w> <w lemma=”μυδροστασία”>μυδροστασίας</w>
The parser could also extract references for both words: ALA2004 80, 5 and 208, 1 respectively. This may be all that can be fully automated by a parser without human intervention, and an editor would need to check even this information for relevance and correct formatting for each lexicon entry. Nevertheless, this would be a start, and if the parser also extracted the immediate context of the word in the Greek text, the translation, and the immediate discussion from the electronic file, the editor’s task would be greatly facilitated.
At present, as mentioned above, there is no part of speech or morphological information in the EpiDoc markup scheme, so an editor would have to specify that both of these words are nouns, and interpret their declensions so as to give the genitive ending, for example, or decide if a verb is athematic, irregular, vel sim. Likewise, the gloss or definition of the word (depending on the nature of the lexicon) would have to be derived from the translation and commentary rather than automatically extracted by even the most intelligent of parsers. But the information provided by our parser would quickly allow the compilation of a preliminary entry for each word along the following lines:
σελλοφόρος, ου, chair-bearer, cf. διφροφόρος, Lat. sellarius; ALA 80, 5.
μυδροστασία, ας, place of the μύδρος, anvil = ?forge; ALA 208, 1, τό(πος) τῆς μυδροστασίας.
Both the extraction of personal names by a prosopographical project, and of lemmatised word-forms by a new lexicon (and similarly place names by an atlas or gazetteer project, which I have not discussed), could be facilitated by a preliminary pass of a parser over epigraphical texts marked up in EpiDoc XML. This would not remove the human editor from the chain: even if the processor could create a complete entry from our format, one would want the result checked by a human at the very least. Nor would this process completely replace traditional research in the creation of slips and compiling new words, or instances of personal and place names; a human editor is still needed to check results and search for references from other sources.
But the electronic edition of epigraphic data (or, in sister projects, papyrological, numismatic, or other textual material) allows for greater accessibility to the information. Not only do more people have access to a web edition than to a printed library book, but the publication of source files in the form of XML and other code allows the data to be queried and manipulated in ways that do not have to be predicted by the original project’s editors.