METHOD FOR CLASSIFYING DOCUMENTS ACCORDING TO THE COMPLEXITY OF DATA EXTRACTION BY LARGE LANGUAGE MODELS
Abstract
The article addresses the urgent problem of optimizing automated data extraction from business documents using large language models (LLM). The quality of processing documents varies significantly depending on their structural and semantic characteristics. The absence of methods for predicting extraction quality leads to inefficient resource utilization. Existing document classification research focuses on thematic categorization rather than assessing the technical complexity of data extraction. To address this problem, a method for classifying documents based on the complexity of their processing by language models is proposed. The method is based on document markup using binary features of structural-semantic complexity. For each document, automatic data extraction is performed by three language models in zero-shot mode, with calculation of an integral quality metric through the harmonic mean of precision and recall. Based on these metrics, complexity classes are formed, followed by construction of classifiers using multiclass logistic regression. Validation is performed through stratified cross-validation. The key feature of the method is the ability to automatically determine the expected quality of document processing based on its formalized characteristics. The method was tested on a corpus of synthetic documents with varied complexity characteristics. For three LLMs, three-level complexity classifiers were built. Analysis of weight coefficients revealed critical complexity factors that demonstrate the greatest negative impact on extraction quality. The proposed solution has both theoretical and practical significance. The scientific novelty lies in creating the first empirically validated method for classifying documents where the target variable is the expected quality of data extraction by language models. The practical value is the possibility of automated decision-making regarding processing strategy in production systems. The results create a methodological foundation for developing intelligent document processing systems and optimizing computational resource utilization.
References
2. Божко О. Ю. Розробка ітеративного методу екстракції даних з неструктурованих документів на основі використання великих мовних моделей. Вісник Кременчуцького національного університету імені Михайла Остроградського. 2025. № 1. С. 119–124. DOI: 10.32782/1995-0519.2025.1.15.
3. Xu Y., Li M., Cui L., Huang S., Wei F., Zhou M. LayoutLM: pre-training of text and layout for document image understanding. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). 2020. С. 1192–1200. DOI: 10.1145/3394486.3403172.
4. Rijcken E., Zervanou K., Mosteiro P., Scheepers F., Spruit M., Kaymak U. Machine learning vs. rule-based methods for document classification of electronic health records within mental health care: a systematic literature review. Natural Language Processing Journal. 2025. Т. 10. Стаття 100129. DOI: 10.1016/j.nlp.2025.100129.
5. Li B., та ін. AID-Agent: an LLM-Agent for advanced extraction and integration of documents. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025). 2025. С. 80–88. DOI: 10.18653/v1/2025.realm-1.6.
6. Li H., та ін. Extracting financial data from unstructured sources: leveraging large language models. SSRN Electronic Journal. 2023. DOI: 10.2139/ssrn.4567607. URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4567607 (Дата звернення: 15.10.2025).
7. Almeida F. C., Caminha C. Evaluation of entry-level open-source large language models for information extraction from digitized documents. Symposium on Knowledge Discovery, Mining and Learning (KDMiLe). 2024. С. 25–32. DOI: 10.5753/kdmile.2024.243859. (Офіц. стор.: sol.sbc.org.br).
8. Tito R., Karatzas D., Valveny E. Hierarchical multimodal transformers for multi-page DocVQA. arXiv:2212.05935. 2023. DOI: 10.48550/arXiv.2212.05935. URL: https://arxiv.org/abs/2212.05935 (Дата звернення: 15.10.2025).
9. Ranaweera U., Mawitagama B., Liyanage S., Keshan S., De Silva T., Hewawalpita S. Comparison of machine learning models to classify documents on digital development. у кн.: Data Science and Artificial Intelligence / ред. C. Anutariya, M. M. Bonsangue. Singapore: Springer, 2023. (CCIS, т. 1942). С. 59–73. DOI: 10.1007/978-981-99-7969-1_5.
10. Le D. X., Thoma G. R. Page layout classification technique for biomedical documents. Proceedings of the World Multiconference on Systems, Cybernetics and Informatics (SCI). 2000. Т. X. С. 348–352. URL: https://lhncbc.nlm.nih.gov/LHC-publications/PDF/pub2000015.pdf (Дата звернення: 15.10.2025).
11. Petrov K., Chalyi T. Situational model of a medical business process. Bulletin of National Technical University “KhPI”. Series: System Analysis, Control and Information Technologies. 2024. № 2(12). С. 42–45. DOI: 10.20998/2079-0023.2024.02.07.
12. Shin C., Doermann D., Rosenfeld A. Classification of document pages using structure-based features. International Journal on Document Analysis and Recognition (IJDAR). 2001. Т. 3, № 4. С. 232–247. DOI: 10.1007/PL00013566.

This work is licensed under a Creative Commons Attribution 4.0 International License.
ISSN 



