I teach machines to read old and non-European documents: the handwriting and print that standard text recognition was never built for.

Most text recognition algorithms and systems are created for modern conventions. Historical material rarely conforms: spelling, layout, and scribal practice vary enormously, and for most collections training data is sparse. My work redesigns recognition to cope with that scarcity and messiness: Arabic manuscripts, medieval Hebrew, Chinese inscriptions, and whatever else a historian brings along.

I also build the infrastructure that makes these methods useful to scholars, chiefly kraken and the eScriptorium platform built on top of it. I try to keep these tools frugal and open: models that run on a laptop rather than a data centre, shared datasets, and common transcription norms that let scholars build on each other’s work instead of starting from scratch.

Current projects

  • BasHTR: Building generalized transcription guidelines and training data for Arabic-script text recognition.
  • ATRIUM: A European research infrastructure bridging the arts, humanities, archaeology, and language technologies to widen access to digital research tools and data.
  • MiDRASH: Reconstructing medieval Jewish book culture by making Hebrew, Aramaic, and Judeo-Arabic manuscripts accessible and searchable at scale.
  • Back in Time: Combining AI, cryptography, and history to build tools for deciphering encrypted historical documents.

Selected work

Software & infrastructure

  • kraken: Trainable, script-agnostic text-recognition engine for historical and non-Latin documents.
  • eScriptorium: Collaborative environment for transcribing manuscript and print collections, built on kraken.
  • party: Lightweight transformer model for accurate recognition at a fraction of the usual inference cost.
  • Orli: Ordered Regression of Lines, a learnable, joint layout-analysis and reading-order system.
  • HTRMoPo: Open repository for publishing, citing, and reusing text-recognition models via Zenodo.

Selected publications