TrOCR_Fr - Digitization Pipeline for French Handwritten Archives (Github Repository)

I developed a open-source pipeline tailored for the digitization of French Handwritten Archives, which allows for performing three tasks.

  • Parsing Layout: The initial stage involves layout parsing, wherein global images are segmented into smaller units, each containing a single line of text.
  • Optical Character Recognition (OCR) Module: Following layout parsing, the OCR module takes center stage, transcribing individual images into text.For this purpose, we specifically fine-tuned a baseline TrOCR model for French handwritten text.
  • Named Entity Recognition (NER) Module: In the final stage, the Named Entity Recognition module extracts key information from the transcribed archival content