DETECTION OF SOURCE CODE IN INTERNET TEXTS USING AUTOMATICALLY GENERATED MACHINE LEARNING MODELS

Print

ABSTRACT

In the paper, the authors are presenting the outcome of web scraping software allowing for the automated classification of source code. The software system was prepared for a discussion forum for software developers to find fragments of source code that were published without marking them as code snippets. The analyzer software is using a Machine Learning binary classification model for differentiating between a programming language source code and highly technical text about software. The analyzer model was prepared using the AutoML subsystem without human intervention and fine-tuning and its accuracy in a described problem exceeds 95%. The analyzer based on the automatically generated model has been deployed and after the first year of continuous operation, its False Positive Rate is less than 3%. The similar process may be introduced in document management in software development process, where automatic tagging and search for code or pseudo-code may be useful for archiving purposes.

FULL TEXT

HOW TO CITE THIS PAPER

Badurowicz, M. (2022). Detection of source code in internet texts using automatically generated machine learning models. Applied Computer Science, 18(1), 89-98. https://doi.org/10.23743/acs-2022-07
Badurowicz, Marcin. "Detection of Source Code in Internet Texts Using Automatically Generated Machine Learning Models." Applied Computer Science 18, no. 1 (2022): 89-98.
M. Badurowicz, "Detection of source code in internet texts using automatically generated machine learning models," Applied Computer Science, vol. 18, no. 1, pp. 89-98, 2022, doi: 10.23743/acs-2022-07.
Badurowicz M. Detection of source code in internet texts using automatically generated machine learning models. Applied Computer Science. 2022;18(1):89-98.