Analysis of methods of classification of electronic messages based on neural network models

Authors

DOI:

https://doi.org/10.20535/2411-1031.2023.11.2.293797

Keywords:

message classification, neural networks, natural language processing, spam filtering, text vectorization, email classification, text analysis, model quality evaluation

Abstract

In the article, the creation of a mechanism for detecting and classifying messages is considered, with an assessment of how effectively different neural networks work and can recognize and classify different types of electronic messages, including phishing attacks, spam, and legitimate messages. A preliminary analysis of incoming messages has been performed, encompassing their headers, text, and other relevant attributes. For instance, in the case of emails, these attributes could be the 'subject' and 'sender' of the message. Methods for data preparation and processing have been reviewed, including text vectorization, noise removal, and normalization, to be utilized in training neural networks. Message tokenization has been performed by transforming them into a numerical format while considering the selection of features. For text messages, it is crucial to execute both tokenization and text vectorization. The model training was performed on the test data with prior splitting into two parts: 80% for training and 20% for testing. The training set is utilized for training the model, while the test set is used to evaluate its effectiveness. The peculiarity of the class structure of the data, namely the uniformity of the distribution of classes, is considered. In this case, spam occurs less frequently than legitimate messages, so class balancing techniques such as random deletion of redundant examples, upsampling, and subsampling were applied to ensure adequate model training. Optimization of network parameters was performed, by researching the optimal parameters of neural networks, such as the number and size of layers, activation functions, and optimization of hyperparameters to achieve the best performance. Hyperparameter optimization includes determining optimal settings for neural networks, such as layer size, activation functions, learning rate, and other parameters. The effectiveness was assessed by comparing the results and performance of various classification methods based on neural networks using metrics such as precision and F1-score. It was determined how well the methods can avoid misclassifications where legitimate messages are mistakenly identified as spam, and vice versa. A comparison of the methods' effectiveness in processing a large volume of messages in real time was conducted. An analysis of different architectures of neural network models was performed. Based on the analysis, it was revealed how effectively different neural network models can recognize and classify messages as spam.

Author Biographies

Volodymyr Onishchenko, Institute of special communications and information security National technical university of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

junior researcher

Anatolii Minochkin, Heroiv Krut Military institute of telecommunications and informatization, Kyiv

doctor of technical sciences, professor, leading researcher

References

D. Jurafsky, and J. H. Martin, Speech and Language Processing (2nd ed.), London, UK: Pearson Education, 2009.

C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008.

A. N. Soni, “Spam e-mail detection using advanced deep convolution neural network algorithms”, Journal for innovative development in pharmaceutical and technical science, vol. 2, iss. 5, pp. 74-80, 2019.

S. Smadi, N. Aslam, and L. Zhang, “Detection of online phishing email using dynamic evolving neural network based on reinforcement learning”, Decision Support Systems, vol. 107, pp. 88-102, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/abs/pii/S0167923618300010. Accessed on: June 22, 2023.

S. Kumar, A. K. Sharma, and M. Aslam, “A comparative study between naïve Bayes and neural network (MLP) classifier for spam email detection”, in Proc. National Seminar on Recent Advances in Wireless Networks and Communications, NWNC-2014, vol. 2, 2014. [Online]. Available: https://www.researchgate.net/publication/360426773_A_Comparative_Study_Between_Naive_Bayes_and_Neural_Network_MLP_Classifier_for_Spam_Email_Detection. Accessed on: June 22, 2023.

K. Kowsari, M. K. Jafari, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey”, Information, vol. 10, iss. 4, 150. (2019). doi: http://dx.doi.org/10.3390/info10040150.

K. U. Santoshi, S. S. Bhavya, Y. B. Sri, and V. Venkateswarlu, “Twitter spam detection using naïve bayes classifier”, in Proc. 2021 6th international conference on inventive computation technologies (ICICT), Coimbatore, India, pp. 773-777, 2021. doi: http://dx.doi.org/10.1109/ICICT50816.2021.9358579.

Lv. Teng, Y. Ping, Y. Hongwu, and H. Weimin, “Spam filter based on naive Bayesian classifier”, Journal of Physics: Conference Series, vol. 1575, no. 1, p. 012054. doi: http://dx.doi.org/10.1088/1742-6596/1575/1/012054.

Downloads

Published

2023-12-28

How to Cite

Onishchenko, V., & Minochkin, A. (2023). Analysis of methods of classification of electronic messages based on neural network models. Collection "Information Technology and Security", 11(2), 216–226. https://doi.org/10.20535/2411-1031.2023.11.2.293797

Issue

Section

ARTIFICIAL INTELLIGENCE IN THE CYBERSECURITY FIELD