TrOCR_Fr - Digitization Pipeline for French Handwritten Archives (Github Repository)

An open-source pipeline for digitizing French handwritten archives, developed with Arnault Gombert. Built around three core steps:

  • Layout parsing: Raw document images are segmented into individual lines of text before anything else happens.
  • OCR: Each line image is then transcribed using a TrOCR model fine-tuned specifically on French handwritten text.

  • Named Entity Recognition: Finally, a NER module pulls out structured information from the transcribed content — names, places, dates, and so on.