LEXICAL CONVOLUTION IN ANALYZING THE SIMILARITY OF PROGRAM TEXTS

Keywords: copyright, similarity of program code, lexical convolution, token.

Abstract

The article is devoted to solving the problem of copyright protection on texts of computer programs. Although at the legislative level, the source and object codes of computer programs are recognized as subject to protection and to copyright, the practical implementation of this is not perfect. The reason is that, historically, the problem of protecting authorship of literary texts arose first, and then this approach spread to the texts of computer programs. In this case, program codes are considered only as a kind of literary texts, therefore, for the analysis of their similarity, the same techniques are proposed that apply to literary texts. They do not take into account the peculiarities of the texts of computer programs, especially the grammatical rules for constructing program codes. Unlike the grammar of literary texts, the syntax of programming languages is built on stricter rules, which have a formalized form and are described using metalanguages. Therefore, any operator or instruction has in its composition the constant expressions, which, when compiling a computer program, are considered as standard tokens of a particular programming language. Their names and locations cannot be arbitrary, and therefore they define, as it were, the lexical skeleton of the program. But when creating program code, its author has the opportunity to freely use proper names for certain components of a computer program – variable names, labels, developed functions, etc. These names refer to user tokens and are not considered as permanent command components when compiled. They can be easily exchanged in the source code without any change in the sequence of standard tokens. Such “cloning” of program code by dishonest users often remains invisible, because software tools for finding the similarity of texts give a significantly underestimated result, since they do not distinguish between standard and user tokens in the texts being compared. The same wrong approach to the texts of computer programs can also provide an overestimation when compared due to the same disadvantages. This is proved by the examples given in the article. The article proposes an approach in which standard tokens are separated from user tokens in the texts of computer programs, As a result, the latter have much less influence on the result of checking the similarity of texts. This transformation, which is called lexical convolution, is demonstrated on the instance of the basic constructions of the C programming language and a fragment of the software code. At the same time, it is possible to expand on the other program languages.

References

1. Про авторське право і суміжні права: Закон України від 23.12.93 № 3793-XII. Відомості Верховної Ради України, 1994. № 13. Ст. 64. Дата оновлення: 20.03.2023. URL: https://zakon.rada.gov.ua/laws/ show/2811-20#n855 (дата звернення 15.04.2023).
2. Онлайн-платформи та програми для перевірки тексту на плагіат. URL: https://osvita.ua/vnz/76907/ (дата звернення 17.04.2023).
3. Антиплагіатні системи, перевірка на плагіат. URL: http://library.chnu.edu.ua /index.php?page= ua/07services/06acad_int/02anti_plag_sys (дата звернення 17.04.2023).
4. 6 сервісів перевірки унікальності для українських копірайтерів. URL: https://wordfactory.ua/ plagiarism-checker/ (дата звернення 17.04.2023).
5. Інструменти перевірки текстів на плагіат. URL: https://lib.zsmu.edu.ua/p_82.html
6. (дата звернення 17.04.2023).
7. Boyer R. S., Moore J. S. A fast string searching algorithm. Communication of the ACM. 1977. V. 20. № 10. P. 762–772.
8. Knuth D. E., Morris J. H., Jr., Pratt V. R. Fast pattern matching in strings. SIAM Journal on Computing. 1977. V. 6. № 2. P. 323–350. DOI: https://doi.org/10.1137/0206024.
9. Broder A. Z. On the resemblance and containment of documents. Proceedings. Compression and Complexity of SEQUENCES 1997 (Salerno, Italy13-13 June 1997). IEEE Computer Society, 1998. P. 21–29. DOI: https:// doi.org/ 10.1109/SEQUEN.1997.666900.
10. Павлов В. Г. Контекстний підхід у аналізі схожості текстів програм. Вчені записки Таврійського національного університету імені В.І. Вернадського. Серія: Технічні науки. 2023. Том 34 (73). № 2.
11. ISO/IEC 14977:1996 Information technology – Syntactic metalanguage – Extended BNF, New York : American National Standards Institute, 1996. 10 p.
12. Kernighan B. W., Ritchie D.M. The C Programming Language / Copyright 1978, Second Edition, Ney Jersey, Prentice-Hall, 1988. 272 p.
Published
2023-07-14
How to Cite
Pavlov , V. G. (2023). LEXICAL CONVOLUTION IN ANALYZING THE SIMILARITY OF PROGRAM TEXTS. Systems and Technologies, 65(1), 53-59. https://doi.org/10.32782/2521-6643-2023.1-65.7
Section
COMPUTER ENGINEERING