TrOCR_Fr - Digitization Pipeline for French Handwritten Archives (Github Repository)
An open-source pipeline for digitizing French handwritten archives, developed with Arnault Gombert. Built around three core steps:
- Layout parsing: Raw document images are segmented into individual lines of text before anything else happens.
OCR: Each line image is then transcribed using a TrOCR model fine-tuned specifically on French handwritten text.
Named Entity Recognition: Finally, a NER module pulls out structured information from the transcribed content — names, places, dates, and so on.

