Benjamin Kiessling

I teach machines to read old and non-European documents: the handwriting and print that standard text recognition was never built for.

Most text recognition algorithms and systems are created for modern conventions. Historical material rarely conforms: spelling, layout, and scribal practice vary enormously, and for most collections training data is sparse. My work redesigns recognition to cope with that scarcity and messiness: Arabic manuscripts, medieval Hebrew, Chinese inscriptions, and whatever else a historian brings along.

I also build the infrastructure that makes these methods useful to scholars, chiefly kraken and the eScriptorium platform built on top of it. I try to keep these tools frugal and open: models that run on a laptop rather than a data centre, shared datasets, and common transcription norms that let scholars build on each other’s work instead of starting from scratch.

Current projects

BasHTR: Building generalized transcription guidelines and training data for Arabic-script text recognition.
ATRIUM: A European research infrastructure bridging the arts, humanities, archaeology, and language technologies to widen access to digital research tools and data.
MiDRASH: Reconstructing medieval Jewish book culture by making Hebrew, Aramaic, and Judeo-Arabic manuscripts accessible and searchable at scale.
Back in Time: Combining AI, cryptography, and history to build tools for deciphering encrypted historical documents.

Selected work

Software & infrastructure

kraken: Trainable, script-agnostic text-recognition engine for historical and non-Latin documents.
eScriptorium: Collaborative environment for transcribing manuscript and print collections, built on kraken.
party: Lightweight transformer model for accurate recognition at a fraction of the usual inference cost.
Orli: Ordered Regression of Lines, a learnable, joint layout-analysis and reading-order system.
HTRMoPo: Open repository for publishing, citing, and reusing text-recognition models via Zenodo.

Selected publications

ICDAR 2026 Competition on Multilingual Medieval Handwriting Recognition. ICDAR, 2026
Transcription Guidelines for Generalized Automatic Text Recognition, 2025
CATMuS Medieval: A Multilingual Large-Scale Cross-Century Dataset in Latin Script for Handwritten Text Recognition and Beyond. ICDAR, 2024
Sharing Data for Handwritten Text Recognition. Digital Humanities in Practice (Routledge), 2024
A Modular Region and Text Line Layout Analysis System. ICFHR, 2020
BADAM: A Public Dataset for Baseline Detection in Arabic-Script Manuscripts. HIP @ ICDAR, 2019