Analysis of methods of classification of electronic messages based on neural network models
DOI:
https://doi.org/10.20535/2411-1031.2023.11.2.293797Keywords:
message classification, neural networks, natural language processing, spam filtering, text vectorization, email classification, text analysis, model quality evaluationAbstract
In the article, the creation of a mechanism for detecting and classifying messages is considered, with an assessment of how effectively different neural networks work and can recognize and classify different types of electronic messages, including phishing attacks, spam, and legitimate messages. A preliminary analysis of incoming messages has been performed, encompassing their headers, text, and other relevant attributes. For instance, in the case of emails, these attributes could be the 'subject' and 'sender' of the message. Methods for data preparation and processing have been reviewed, including text vectorization, noise removal, and normalization, to be utilized in training neural networks. Message tokenization has been performed by transforming them into a numerical format while considering the selection of features. For text messages, it is crucial to execute both tokenization and text vectorization. The model training was performed on the test data with prior splitting into two parts: 80% for training and 20% for testing. The training set is utilized for training the model, while the test set is used to evaluate its effectiveness. The peculiarity of the class structure of the data, namely the uniformity of the distribution of classes, is considered. In this case, spam occurs less frequently than legitimate messages, so class balancing techniques such as random deletion of redundant examples, upsampling, and subsampling were applied to ensure adequate model training. Optimization of network parameters was performed, by researching the optimal parameters of neural networks, such as the number and size of layers, activation functions, and optimization of hyperparameters to achieve the best performance. Hyperparameter optimization includes determining optimal settings for neural networks, such as layer size, activation functions, learning rate, and other parameters. The effectiveness was assessed by comparing the results and performance of various classification methods based on neural networks using metrics such as precision and F1-score. It was determined how well the methods can avoid misclassifications where legitimate messages are mistakenly identified as spam, and vice versa. A comparison of the methods' effectiveness in processing a large volume of messages in real time was conducted. An analysis of different architectures of neural network models was performed. Based on the analysis, it was revealed how effectively different neural network models can recognize and classify messages as spam.
References
D. Jurafsky, and J. H. Martin, Speech and Language Processing (2nd ed.), London, UK: Pearson Education, 2009.
C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008.
A. N. Soni, “Spam e-mail detection using advanced deep convolution neural network algorithms”, Journal for innovative development in pharmaceutical and technical science, vol. 2, iss. 5, pp. 74-80, 2019.
S. Smadi, N. Aslam, and L. Zhang, “Detection of online phishing email using dynamic evolving neural network based on reinforcement learning”, Decision Support Systems, vol. 107, pp. 88-102, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/abs/pii/S0167923618300010. Accessed on: June 22, 2023.
S. Kumar, A. K. Sharma, and M. Aslam, “A comparative study between naïve Bayes and neural network (MLP) classifier for spam email detection”, in Proc. National Seminar on Recent Advances in Wireless Networks and Communications, NWNC-2014, vol. 2, 2014. [Online]. Available: https://www.researchgate.net/publication/360426773_A_Comparative_Study_Between_Naive_Bayes_and_Neural_Network_MLP_Classifier_for_Spam_Email_Detection. Accessed on: June 22, 2023.
K. Kowsari, M. K. Jafari, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey”, Information, vol. 10, iss. 4, 150. (2019). doi: http://dx.doi.org/10.3390/info10040150.
K. U. Santoshi, S. S. Bhavya, Y. B. Sri, and V. Venkateswarlu, “Twitter spam detection using naïve bayes classifier”, in Proc. 2021 6th international conference on inventive computation technologies (ICICT), Coimbatore, India, pp. 773-777, 2021. doi: http://dx.doi.org/10.1109/ICICT50816.2021.9358579.
Lv. Teng, Y. Ping, Y. Hongwu, and H. Weimin, “Spam filter based on naive Bayesian classifier”, Journal of Physics: Conference Series, vol. 1575, no. 1, p. 012054. doi: http://dx.doi.org/10.1088/1742-6596/1575/1/012054.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Collection "Information Technology and Security"
This work is licensed under a Creative Commons Attribution 4.0 International License.
The authors that are published in this collection, agree to the following terms:
- The authors reserve the right to authorship of their work and pass the collection right of first publication this work is licensed under the Creative Commons Attribution License, which allows others to freely distribute the published work with the obligatory reference to the authors of the original work and the first publication of the work in this collection.
- The authors have the right to conclude an agreement on exclusive distribution of the work in the form in which it was published this anthology (for example, to place the work in a digital repository institution or to publish in the structure of the monograph), provided that references to the first publication of the work in this collection.
- Policy of the journal allows and encourages the placement of authors on the Internet (for example, in storage facilities or on personal web sites) the manuscript of the work, prior to the submission of the manuscript to the editor, and during its editorial processing, as it contributes to productive scientific discussion and positive effect on the efficiency and dynamics of citations of published work (see The Effect of Open Access).