Today I attended the penultimate workshop in the CHASE Arts and Humanities in the Digital Age programme, which was on the topic of databases and led by Sharon Webb from the University of Sussex. Like the other workshops in the series it partly consolidated my existing knowledge and partly taught me new things – in particular, it differed from previous database training I have attended in that it was focused more on the principles behind database design rather than the practicalities of creating a database using specific software.
Sharon started by talking about use of data in the Humanities more generally, taking a chronological approach from the 1940s(!) onwards. This was especially pertinent for me at the moment as I am in the process of writing a literature review on data use in the Humanities, and Sharon signposted some interesting research by Roberto Busa and Manfred Thaller that I will follow up. I particularly liked the way that Sharon referred to the ‘Problem Domain’ of a Digital Humanities project being the Humanities and the ‘Solution Domain’ as Computer Science. This really crystallises the relationship between the two disciplines and highlights that the purpose of applying technology to a Humanities project is in order to facilitate the answering of a research question, rather than ‘just because we can’.
We then discussed the process of requirements analysis – the various factors that need to be taken into account before a database can be designed. These include the process model to be used during the implementation phase (examples included the waterfall model vs. the agile model), balancing user and stakeholder needs, and how the data should be modelled. This was particularly interesting to me as while I am familiar with aspects of the data-based approach (storing different pieces of data in e.g. a database, with no reference to the original document) and the document-based approach (preserving the original document but enhancing it with e.g. TEI to provide structure and semantics), I had not realised that these were formalised concepts and that either or both could have a role to play, depending on the end goals of the project. Most of my work has involved extracting entities and concepts and organising them into some kind of structure, but I can see that in other instances it would be important to retain the document as a whole as it provides important context and allows analysis of the phrasing surrounding particular concepts, which would have been lost if the concepts were simply extracted and standardised.
Finally, we arrived at the subject of databases. Rather than starting from the tabular structure, Sharon took us through the various steps of creating an entity-relationship diagram. This involves identifying all the entities in your data (e.g. people, places, organisations), and visualising their attributes (the pieces of information you have about each entity) and relationships with each other. This includes establishing whether the relationship is ‘one-to-one’, ‘one-to-many’ or ‘many-to-many’ – if the relationship is ‘many-to-many’, you will need to create one table for each entity to contain the attributes you have identified, and then a third ‘linking’ table containing the ID numbers of the particular instances of each entity that you want to link.
I had not previously used this method for creating databases, but it makes a lot of sense and helps you to think about the overarching picture, as well as whether the data structure will provide you with what you need before the database is even populated. However (rather predictably), part of me did also think that if you have a complex entity-relationship diagram, then Linked Data might be a better way of structuring that information rather than shoehorning it into a relational database structure. Obviously as a Linked Data advocate this is exactly what I would think – part of my future research will involve exploring these other methods and seeing what advantages a Linked Data approach might have, on balance with the effort involved to produce the infrastructures and interface required to make the resulting resource usable.
Generally, I found this to be a really interesting workshop as well as an effective approach to teaching about databases. There was clearly no scope in an already-packed day for discussing the advantages and disadvantages of databases over other data structures (e.g. Linked Data), but it certainly gave me a lot to think about for my own research. I will need to think about the advantages of each approach to structuring data, to see what advantages different approaches have in common, and to ascertain which might work best in different contexts. At the moment, it seems to me that Linked Data shares all of the advantages of a relational database in terms of connecting entities and describing their relationships, but has so much more than this, particularly in terms of linking to external resources and ontologies. It also transcends the more rigid structures of a relational database, making it potentially more futureproof, as well as platform-agnostic. However, I also understand that this is probably quite a blinkered way of looking at things and that I will need to question these views by investigating these other approaches in order to reach more robust conclusions than ‘I think this makes sense’.
Thank you to Sharon Webb for leading the workshop, and to the Arts and Humanities in the Digital Age team for organising – it has certainly given me some food for thought!