I teach machines to read old and non-European documents: the handwriting and print that standard text recognition was never built for.
Most text recognition algorithms and systems are created for modern conventions. Historical material rarely conforms: spelling, layout, and scribal practice vary enormously, and for most collections training data is sparse. My work redesigns recognition to cope with that scarcity and messiness: Arabic manuscripts, medieval Hebrew, Chinese inscriptions, and whatever else a historian brings along.
I also build the infrastructure that makes these methods useful to scholars, chiefly kraken and the eScriptorium platform built on top of it. I try to keep these tools frugal and open: models that run on a laptop rather than a data centre, shared datasets, and common transcription norms that let scholars build on each other’s work instead of starting from scratch.