ARCHITECTURE OF A SEMI-AUTOMATED ANNOTATION SYSTEM FOR MULTILINGUAL ARCHIVAL HANDWRITTEN TEXTS
Abstract
The developed system architecture enables the creation of datasets for further processing using machine learning and deep learning techniques. This approach plays a key role in addressing the challenges associated with the automated recognition of historical handwritten documents, particularly in complex multilingual and multi-script environments. A significant portion of Ukraine’s archival heritage, especially documents dating from the 14th to the 19th centuries, contains texts written in various languages–including Ukrainian, Polish, Russian, and Ottoman Turkish–using different scripts such as Cyrillic, Latin, and Arabic.Traditional Optical Character Recognition (OCR) systems are typically designed for printed texts and are limited in their ability to handle the variability and noise present in historical manuscripts. Furthermore, they often lack support for mixed-language documents and rare historical scripts, making them unsuitable for large-scale archival digitization projects. In contrast, the proposed system architecture not only allows for the semi-automated labeling of individual characters but also incorporates user interaction, validation mechanisms, and multilingual capabilities. These features significantly improve the quality of the labeled data and ensure its suitability for downstream machine learning tasks.The resulting datasets can be used to train modern recognition models, including convolutional neural networks (CNNs) and transformer-based architectures, which have demonstrated high effectiveness in visual and sequence processing tasks. By generating high-quality, annotated training samples, the system contributes to the development of robust handwriting recognition solutions that can adapt to historical variation in script style, ink degradation, and complex page layouts. Moreover, the architecture supports iterative model refinement through human-in-the-loop strategies, where user feedback is incorporated to improve recognition accuracy over time. This is particularly important in the digital humanities domain, where expert validation and domain-specific knowledge play a critical role in ensuring the reliability of computational tools.Ultimately, the proposed system facilitates the preservation, accessibility, and computational analysis of cultural heritage documents, thereby supporting historians, linguists, and archivists in their research efforts.
References
2. Milioni, N. Automatic transcription of historical documents: Transkribus as a tool for libraries, archives and scholars. DiVA Portal. 2020.
3. Gurmu, M. G. Offline handwritten text recognition of historical Ge’ez manuscripts using deep learning techniques. ResearchGate PDF. 2021.
4. Nikolaidou, K., Seuret, M., Mokayed, H., & Liwicki, M. (2022). A survey of historical document image datasets. International Journal on Document Analysis and Recognition (IJDAR), 25(4), 305-338.
5. Aguilar, S. T., & Jolivet, V. Handwritten text recognition for documentary medieval manuscripts. HAL. 2023.
6. Sinwar, D., Dhaka, V. S., Pradhan, N., & Pandey, S. Offline script recognition from handwritten and printed multilingual documents: a survey. International Journal on Document Analysis and Recognition (IJDAR). 2021. 24(1), 97-121.
7. Capurro, C., Provatorova, V., & Kanoulas, E. Experimenting with training a neural network in transkribus to recognise text in a multilingual and multi-authored manuscript collection. Heritage, 2023. 6(12), 7482-7494.
8. Bergamaschi, S., De Nardis, S., Martoglia, R., Ruozzi, F., Sala, L., Vanzini, M., & Vigliermo, R. A. Novel perspectives for the management of multilingual and multialphabetic heritages through automatic knowledge extraction: The digitalmaktaba approach. Sensors, 2022. 22(11), 3995.
9. Aguilar, S. T. Handwritten Text Recognition for Historical Documents using Visual Language Models and GANs. HAL. 2024
10. Deepa, A., Srija, B., Jabeen, M., Kankar, K. R., Lakshmi, B. J., & Negi, A. A Review on Automated Annotation System for Document Text Images. In 2024 1st International Conference on Cognitive, Green and Ubiquitous Computing (IC-CGU) (pp. 1-6). IEEE. 2024.