The Human Element

Posted by Nathan Dize on Thursday, October 25, 2018 in Uncategorized.

Screen Shot 2018-10-25 at 9.47.27 AM

The Text-Encoding Initiative (TEI) has multiple purposes, but perhaps its foremost and most recognizable one is the digital preservation of texts. Today I’ll be writing about a common argument I hear against TEI: “The texts are available on Google Books! Why should we go through the labor-intensive process of encoding them into TEI when their images are digitized?”

I’m glad you asked! The Google Books project is an amazing undertaking, one which should be applauded. I can sit at my desk and look at a copy of a 19th-century book at any time. I can also search this book by virtue of the fact that Google Books are put into plain text using Optical Character Recognition (OCR). But unfortunately, OCR can be imperfect. “Dirty OCR” exists, and it can impede searchability. The older the book, the less modern the typeface, and the more type damage that exists, the more likely that the OCR will be compromised. Overall, it’s an undeniably helpful technology, but computers are still no match for skilled human readers and editors. For this reason, you shouldn’t simply take an OCRed text and convert it into a TEI document, by adding a TEI Header and some paragraph tags, for instance.

However, it should be stated that any decisions you make in order to represent the characteristics of texts in TEI already move you closer to creating usable TEI. (What do I mean by “usable TEI”? I’ll get to this in a future blog post, but to be brief, I mean TEI that has use value to scholars beyond preservation.) To put a text into TEI, you have to think carefully about its qualities, divisions, and idiosyncrasies and decide how best to represent them. You become an editor–a maker of decisions about texts–whether you readily think of yourself as one or not. (And the TEI Header forces you to identify yourself as such a creator of meaning, too.) In this way, even the dirtiest of OCR is better off as TEI than just as plain text.*

Once you’re wearing your editor hat, though, it makes sense to do more than transcribe the text and represent its most basic characteristics. For best practice, a TEI document featuring poetry should provide, not simply line breaks between lines, but markup indicating that there are line groups (<lg>), and more specifically that that the line groups are quatrains (@type=”quatrain”). You can see how far we’ve moved already from plain-text OCR, just with some basic tagging.

A greater time investment with different goals is to create an actual critical edition by utilizing the critical apparatus elements of TEI. You can provide editorial footnotes, identify variant readings, identify people and places referred to in the text, and much more. (Of course, some of these things require that you be at least somewhat of an authority on the material you’re working with.) In some ways, creating a critical edition is simply taking digital representation and preservation one step further by adding more information that you made decisions about.

All of this is preliminary to the process of actually using TEI for research purposes, which I promise I’ll get into in a future post. But for now, it should be clearer why TEI-encoded documents are more valuable than Google Books: in short, because of the human element (no pun intended!). Preserving texts should be about more than simply preserving raw data–a fact that is true for the physical book itself as much as for the text it contains. We don’t simply preserve old books by putting them on a shelf: we do binding repairs and build protective casing. In the same way, there should be an interventionary aspect of digital preservation that helps us understand the text better, even if that intervention is as basic as proper transcription.

*Please, do not just use unchecked OCR as the basis for your TEI document!

Tags: Digital Editions, Humans and Computers, OCR, TEI