On 15 May I attended the Open Epigraphic Data unconference, organised by Gabriel Bodard from the Institute of Classical Studies (University of London). I wasn’t quite sure what to expect, as although I’ve attended an unconference before, it was based more around the idea of talks and discussions than producing practical code-based solutions to actual problems with a specific resource. The main aim of the day was to increase the use and interoperability of the Epigraphic Database Heidelberg (EDH) dataset, an online collection containing the texts of Latin and bilingual (Latin and Greek) inscriptions from the Roman Empire, marked up using EpiDoc.
Gabriel produced a Google Doc in advance of the event to add ideas about different tasks that we could work on over the course of the day. These included mapping the data to equivalent fields in external resources, ‘mashing-up’ the data with external resources, recognising and disambiguating entities mentioned in inscriptions, and extracting and manipulating specific metadata fields.
In the morning, we divided into three groups – one had a discussion on epigraphic ontologies, another looked at disambiguating personal names, and the third looked at extracting images and related metadata. I was in the images group, and in the time available we ended up creating a script that would iterate through all the image metadata files, extract their filenames, locate images on the EDH website with matching filenames, and download these images. This was a good opportunity for me to practise my Python, which has been getting a bit rusty since I made the last updates to my AHRC data script, and it was very pleasing when we managed to get it to work.
Particular issues we needed to solve included the fact that some metadata files refer to images not on the EDH website (we had to build in a couple of lines of code to tell it to skip any files where the image URL did not exist), and that the script occasionally times out (we told the script to ‘sleep’ for 30 seconds every 200 file downloads). It was great being able to present our results back to the group and we really felt that we’d achieved something in quite a short amount of time.
We didn’t get round to the main aims of the exercise, which were to identify rights information in each file, and to work out which collection owns each image, but we were keen to carry on the work after the unconference so watch this space! The code we produced is available via the unconference’s GitHub repository and more information about this session can be found in the relevant post on the Stoa Consortium blog.
After lunch, three new topics were agreed upon. One related to URI matching between EDH and Pelagios, another was about linking names from EDH to entities in SNAP:DRGN and the third was about implementing Named Entity Recognition (NER) on inscription texts. I was in the NER group, although we unfortunately didn’t actually get round to doing any NER as we spent all the allotted time working out how to extract the inscription text from an XML file, and then to strip out the XML tags to leave the inscription contents in plain text. Googling suggested various methods of varying complexity, none of which worked particularly well – the main issue we found was in extracting the contents of an XML element, rather than the element itself.
Finally, I found the Python library Beautiful Soup, which converts an XML document to structured text, from which you can identify your desired element, then convert the contents of this element to plain text. It was a very simple and elegant solution with only eight lines of code to extract and convert the inscription text from one specific file. The next step is to create a script that will automatically do this for all files in a particular folder, producing a directory of new files that contain only the plain text of the inscriptions. Once we have this, we can then run NER scripts to find any entities that may be located within them. It was a bit disheartening to have been stumped by what turned out to be a fairly simple problem to solve for most of the available time, but at least we managed to produce something from it, and it feels like there is a way forward if we decide to pursue this work in future. Again, there is a Stoa Consortium blog post with more details about this session.
The unconference turned out to be a really interesting day, and it was so encouraging that I was able to make far more of a contribution code-wise than I had been expecting. It seems like some fascinating projects have been started, and it will be exciting to see how all of this work develops in future – it definitely seems like unfinished business at the moment so hopefully I and the rest of my two groups will have time to complete our original objectives. I would like to thank Gabriel Bodard and the Institute of Classical Studies for hosting the event and for providing support towards travel costs.