МЕТОД НАВЧАННЯ НА НЕЗБАЛАНСОВАНИХ ДАНИХ З АДАПТИВНИМ ЗВАЖУВАННЯМ ДЛЯ ВИЯВЛЕННЯ ВРАЗЛИВОСТЕЙ ПРОГРАМНОГО КОДУ

O. V. Tazetdinov; V. H. Babenko

doi:10.32782/2521-6643-2026-2-72.23

O. V. Tazetdinov Cherkasy State Technological University https://orcid.org/0000-0003-4387-0500
V. H. Babenko Cherkasy State Technological University https://orcid.org/0000-0003-2039-2841

DOI: https://doi.org/10.32782/2521-6643-2026-2-72.23

Keywords: software code vulnerabilities, imbalanced data, adaptive weighting, neural networks, curriculum learning, Focal Loss, mini-batch sampling, cybersecurity

Abstract

The article is devoted to the development of a neural network training method on imbalanced data for the task of automatic software code vulnerability detection. In real-world software projects, the proportion of vulnerable code typically does not exceed 5–10 % of the total codebase, resulting in a significant class imbalance with ratios ranging from 10 to 100. Such imbalance leads to gradient dominance of the majority class and substantially reduces the ability of models to detect vulnerabilities, which constitutes their primary objective. An analysis of existing approaches to handling imbalanced data has been conducted, including resampling methods (Random Oversampling, SMOTE, Random Undersampling), static class weighting, and Focal Loss. A comparative evaluation of these approaches is presented with respect to their applicability to source code analysis tasks. It has been demonstrated that resampling methods do not work correctly with the discrete structures of program code, static weights fail to adapt to changes in sample difficulty during the training process, and Focal Loss exhibits sensitivity to hyperparameters under conditions of extreme class imbalance. An adaptive weighting method is proposed that combines three components: class weighting with parametric control of correction strength based on the inverse class frequency raised to a tunable power, dynamic assessment of sample difficulty based on prediction uncertainty measured through classification entropy, and a curriculum learning component that ensures gradual introduction of complex samples through an exponential pacing function. Additionally, an adaptive minibatch sampling strategy has been developed that dynamically adjusts class ratios depending on training progress and guarantees the representation of the minority class in each mini-batch, addressing the problem where standard random sampling results in most batches containing no minority class examples. A mathematical formalization of all method components is provided, including weight normalization mechanisms ensuring unit mean weight and partial compensation of sampling bias through a tunable balance parameter. The research results can be applied to improve the effectiveness of automated software security analysis systems

References

1. Ni C., Shen L., Yang X., Zhu Y., Wang S. MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations. Proceedings of the 21st IEEE/ACM International Conference on Mining Software Repositories (MSR 2024). Lisbon, Portugal, April 15–16, 2024. P. 738–742. DOI: https://doi.org/10.1145/3643991.3644886
2. Ni C., Wang S., Ren J., Chen S., Nguyen T. VULGEN: Realistic Vulnerability Generation Via Pattern Mining and Deep Learning. Proc. 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). 2023. P. 346–358. DOI: https://doi.org/10.1109/
ICSE48619.2023.00211
3. Zheng Y., Pujar S., Lewis B., Buratti L., Epstein E., Yang B., Laredo J., Morari A., Su Z. D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis. Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP 2021).
Madrid, Spain, May 25–28, 2021. P. 111–120. DOI: https://doi.org/10.1109/ICSE-SEIP52600.2021.00020
4. Liu S., Wu B., Xie S., Meng G., Liu Y. ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning. Proc. 45th IEEE/ACM International Conference on Software Engineering (ICSE 2023). 2023. P. 2476–2487. DOI: https://doi.org/10.1109/ICSE48619.2023.00207
5. FIRST.org. Common Vulnerability Scoring System v3.1: Specification Document. Forum of Incident Response and Security Teams (FIRST.org). 2019. URL: https://www.first.org/cvss/v3.1/specification-document
6. Boot H., Reik D., Witte H. National Vulnerability Database. National Institute of Standards and Technology (NIST). 2013. Special Publication 800-51. URL: https://nvd.nist.gov
7. Lin G., Wen S., Han Q.-L., Zhang J., Xiang Y. Software Vulnerability Detection Using Deep Neural Networks: A Survey. Proceedings of the IEEE. 2020. Vol. 108. No. 10. P. 1825–1848. DOI: https://doi.org/10.1109/JPROC.2020.2993293
8. Ghaffarian S.M., Shahriari H.R. Software Vulnerability Analysis and Discovery Using Machine-Learning and Data-Mining Techniques: A Survey. ACM Computing Surveys. 2017. Vol. 50. No. 4. Article 56. P. 1–36. DOI: https://doi.org/10.1145/3092566
9. Wagner D. A., Foster J. S., Brewer E. A., Aiken A. A First Step Towards Automated Detection of Buffer Overrun Vulnerabilities. Proc. 2000 Network and Distributed System Security Symposium (NDSS 2000). 2000. P. 3–17.
10. Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. SMOTE: Synthetic Minority Over-sampling
Technique. Journal of Artificial Intelligence Research. 2002. Vol. 16. P. 321–357. DOI: https://doi.org/10.1613/jair.953
11. Wen S.-C., Wang S., Gao K., Wang S., Liu Y., Gu C. When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection. Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023). Luxembourg, September 11–15, 2023. P. 345–357. DOI:
https://doi.org/10.1109/ASE56229.2023.00144
12. Radford A., Wu J., Child R., Luan D., Amodei D., Sutskever I. Language Models are Unsupervised Multitask Learners. OpenAI Blog. 2019. Vol. 1. P. 9.
13. Cao S., Sun X., Bo L., Wei Y., Li B. BGNN4VD: Constructing Bidirectional Graph Neural-Network for Vulnerability Detection. Information and Software Technology. 2021. Vol. 136. P. 106576. DOI: https://doi.org/10.1016/j.infsof.2021.106576
14. Cao S., Sun X., Wu X., Lo D., Bo L., Li B., Liu X., Lin X., Liu W. Snopy: Bridging Sample Denoising with Causal Graph Learning for Effective Vulnerability Detection. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024). Sacramento, CA, USA, October 27–November 1,
2024. P. 606–618. DOI: https://doi.org/10.1145/3691620.3695057
15. Harer J. A., Kim L.Y., Hamilton R. L., Lazovich T., Russell R. L. et al. Automated Software Vulnerability Detection with Machine Learning. arXiv preprint. 2018. arXiv:1803.04497. DOI: https://doi.org/10.48550/arXiv.1803.04497
16. Russell R. L., Kim L. Y., Hamilton L. H., Lazovich T., Harer J. A., Ozdemir O., Ellingwood P. M., McConley M. W. Automated Vulnerability Detection in Source Code Using Deep Representation Learning. Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA 2018).
Orlando, FL, USA, December 17–20, 2018. P. 757–762. DOI: https://doi.org/10.1109/ICMLA.2018.00120
17. Mikolov T., Sutskever I., Chen K., Corrado G. S., Dean J. Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26 (NIPS 2013). Lake Tahoe, NV, USA, December 5–8, 2013. P. 3111–3119. DOI: https://doi.org/10.48550/arXiv.1310.4546
18. Nguyen G.H., Nguyen V., Nguyen T., Nguyen T.N. MANDO: Multi-Level Heterogeneous Graph Embeddings for Fine-Grained Detection of Smart Contract Vulnerabilities. arXiv preprint. 2022. arXiv:2208.13252. DOI: https://doi.org/10.48550/arXiv.2208.13252
19. Zeng P., Lin G., Pan L., Tai Y., Zhang J. Software Vulnerability Analysis and Discovery Using Deep Learning Techniques: A Survey. IEEE Access. 2020. Vol. 8. P. 197158–197172. DOI: https://doi.org/10.1109/ACCESS.2020.3034766
20. Hin D., Kan A., Chen H., Babar M. A. LineVD: Statement-Level Vulnerability Detection Using Graph Neural Networks. Proceedings of the 19th IEEE/ACM International Conference on Mining Software Repositories (MSR 2022). Pittsburgh, PA, USA, May 23–24, 2022. P. 596–607. DOI: https://doi.org/10.1145/3524842.3527949
21. Li Y., Wang S., Nguyen T.N. Vulnerability Detection with Fine-Grained Interpretations. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Athens, Greece, August 23–28, 2021. P. 292–303. DOI:
https://doi.org/10.1145/3468264.3468597

TRAINING METHOD ON IMBALANCED DATA WITH ADAPTIVE WEIGHTING FOR SOFTWARE CODE VULNERABILITY DETECTION

Abstract

References

Most read articles by the same author(s)