Text Data Processing for Humanists (In Person, East Campus)
REGISTRATION: Click Here
Humanities researchers can amass a considerable number of primary and secondary text-based sources for their research. These may include scans of archival documents such as manuscripts, newspapers, books, and other materials. They may also include varying-quality scans of secondary sources on loan from their own or other libraries. While close reading of this material is key for many humanities researchers, making use of so much data can also be supported by computation: by using computational tools to transcribe handwritten and printed text, scholars can query their text data to quickly find information. These processes, optical character recognition (OCR) for printed text and handwritten text recognition (HTR) for handwritten text, have improved significantly in recent years with machine learning and generative artificial intelligence. In this workshop, we will examine how these technologies work, practice using several tools for OCR and HTR, and consider the opportunities and challenges that can arise when using these technologies with different page layouts, languages, and scripts. Participants are encouraged to bring a laptop.
By the end of this workshop, you will be able to
- describe how OCR and HTR work in general terms;
- identify possible opportunities and challenges when applying OCR and HTR technologies to different page layouts, languages, and scripts;
- implement several OCR and HTR technologies in your research; and
- assess accuracy, clean up processed text, and document workflows for transparency.
This workshop will be facilitated by Hannah Jacobs, Digital Humanities Consultant with Duke Libraries.
Location: East Campus Music Library Seminar Room
Participation: General discussion, structured activity, and time for questions.
Related LibGuide: Digital Humanities by Hannah Jacobs
Attending this event fulfills the RCR-200 requirement for Faculty and Staff and is eligible for 714 RCR credit for graduate students, but participants must attend for 60 minutes and participate in discussion to receive credit.





