Abstract
In recent decades, the proliferation of email communication has markedly escalated, resulting in a concomitant surge in spam emails that congest networks and presenting security risks. This study introduces an innovative spam detection method utilizing the Horse Herd Optimization Algorithm (HHOA), designed for binary classification within multi-objective framework. The method proficiently identifies essential features, minimizing redundancy and improving classification precision. The suggested HHOA attained an impressive accuracy of 97.21% on the Kaggle email dataset, with precision of 94.30%,recall of 90.50%, and F1-score of 92.80%.Compared to conventional techniques, such as Support Vector Machine (93.89% accuracy), Random Forest (96.14% accuracy), and K-Nearest Neighbours (92.08% accuracy), HHOA exhibited enhanced performance with reduced computing complexity. The suggested method demonstrated enhanced feature selection efficiency, decreasing the number of selected features while maintaining high classification accuracy. The results underscore the efficacy of HHOA in spam identification and indicate its potential for further applications in practical email filtering systems.
0 Introduction
Various forms of undesirable email messages, such as spam, junk email, promotional, and business-oriented emails, are commonly unwanted by users in their inboxes. Despite their distinctions, for the purposes of this study, all such messages are categorized as spam emails[1]. Spam encompasses a broad range of inappropriate messages disseminated extensively across the Internet, lacking useful content for recipients. It manifests diverse formats across multiple platforms, including social media, websites, forums, instant messaging, and particularly email. Among these, spamming of emails has become popular due to its widespread use for diverse objectives. Spam refers to unsolicited messages sent either explicitly or implicitly by individuals who lack any prior relationship with the recipient[2]. While emails are convenient and efficient for communication, they can turn into a nuisance when exploited by marketers for product promotion and by scammers for deceptive purposes. The negative effects of spam emails go beyond just wasting resources, time, and effort. They also worsen communication challenges and contribute to cybercrime, ultimately affecting the global economy and causing substantial financial losses for both commercial and individuals every year. Unwanted emails not only consume resources such as bandwidth, storage space, and time spent on removal, but they also present security vulnerabilities.Due to its cost-effectiveness and ease, email has become crucial for personal and professional correspondence as digital communication grows exponentially.However, spam emails have proliferated due to increased email use, providing substantial issues to users and companies[3-4].Spam emails use network bandwidth and storage and expose recipients to phishing, malware, and inappropriate information.Thus, precise and effective spam detection technologies are essential to reduce these dangers. Many spam detection methods exist, however, getting high accuracy with few false positives is difficult. K-Nearest Neighbour (KNN) [5], Multilayer Perceptron (MLP) [6], Support Vector Machines (SVM) , and Naive Bayesian classifiers have limited generalization, high computational cost, and poor performance with massive email data. Feature selection is crucial to classification accuracy, but current methods lack practical optimization algorithms. The current research gap uses standard classification methods or single-objective optimization. These strategies fail to solve the feature selection classification accuracy trade-off. Also, spam detection using binary metaheuristic algorithms has received little attention. Our project aims to create a new spam detection system using a Binary Horse Herd Optimization Algorithm (BHHOA) to maximize feature selection, to use a multi-objective contrastive transformation technique and to assess the proposed algorithm against classical classifiers on benchmark datasets.
The main contributions of our work are: A modified binary version of the Horse Herd Optimization Algorithm (HHOA) for spam email detection; Introduced a contrastive multi-objective framework for better feature selection; Utilized the Kaggle email dataset for comprehensive performance analysis, revealing enhanced classification accuracy and reduced computing complexity.
The remainder of this paper is structured as follows: Section 1 presents the related works, reviewing existing email classification techniques. Section 2 briefly explains horse herd optimization. Section 3 discusses the proposed methodology, including data pre-processing, feature extraction, and model design. Section 4 provides experimental results and performance evaluations. Finally, Section 5 concludes the paper with future research directions.
1 Related Works
Marketers' unsolicited spam emails, aimed at promoting their products, are often deemed bothersome due to their significant occupation of server space. Scammers attempt to obtain users' bank account information through these emails, aiming to steal funds. Additionally, attackers may utilize these spam emails to distripute viruses and other malicious software, often concealing them within enticing and alluring offer links[7]. Hence, it is imperative to promptly tackle the issue of spam emails and implement effective measures to mitigate this problem. Numerous researchers have dedicated their efforts to addressing the challenge of email spam detection, resulting in the proposal of several noteworthy approaches documented in the literature. This section delves into previous studies that concentrate on identifying and categorizing spam using machine learning methods and deep learning algorithms. While Naive Bayes stands out as a commonly utilized algorithm for this purpose, various techniques have been introduced for spam detection. Nonetheless, our current study primarily emphasizes the exploration of metaheuristic optimization algorithms. Bibi's[8] study offered a comprehensive comparison of previous algorithms for spam filtering, examining their accuracy and the datasets utilized. This research provided detailed insights into the straightforward Naive Bayes algorithm, recognized as one of the top classification algorithms for text mining. Their evaluation of classifier for detection of spam revealed that employing WEKA, the Naive Bayes algorithm delivers effective accuracy and precision. Srinivasan et al.[9] introduced a method for detecting spam that embeds words within a deep learning framework in the context of Natural Language Processing (NLP) . Their study highlighted that deep learning surpasses classical classifiers in terms of effectiveness for spam detection. A spam mail detection system developed by Sharma and Bhardwaj[10] utilized a hybrid machine learning approach that combines Naive Bayes and the J8 decision tree. The hybrid system comprises four models: preparation of data set, pre-processing of data, selection of features, and a hybrid bagged approach. The experiments were conducted, with the first two experiments focusing on Naive Bayes and J8, while one experiment evaluated the proposed method. The proposed system achieved an accuracy of 87.5%.
Carreras and Màrquez[11] utilized a decision tree in their study to sift through unwanted emails. Due to the inherent challenge of defining features specific to spam emails, this approach is not widely adopted in spam filtering. Meanwhile, Harisinghaney et al.[12] employed K-Nearest Neighbors (KNN) , Naive Bayes, and Reverse DBSCAN algorithms to categorize based on image and text. They conducted a performance comparison of these algorithms based on four measurement factors. Soni[13] introduced a novel model for detecting spam called THEMIS, which simultaneously analyses emails at various levels such as header, body, character, and words. This innovative approach employs deep Convolutional Neural Network (CNN) algorithms for identifying emails which are not legitimate.
Our proposed method's performance and efficiency are to be rigorously assessed and evaluated using several profound and popular algorithms which can perform optimization as well as classification. For this purpose, this kind of algorithms was selected for simulation from the literature, and their performance and efficiency were compared to that of the proposed approach. The results demonstrate that the proposed method surpasses existing approaches, exhibiting higher accuracy and precision, also reduction in time for execution and rate of error. Thus, the superiority of the new method lies in its enhanced accuracy and speed, along with lower error rates and lower complexity. As mentioned previously, in order to incorporate HHOA for feature selection, we converted it from its original continuous form to a discrete algorithm. Additionally, recognizing that feature selection entails a multi-objective challenge, we further adapted HHOA into a framework that supports multi-objective, employing it to select features that cause spam. To our understanding, this represents the inaugural research attempt in this domain, introducing both a binary and multi-objective rendition of HHOA.
Drawbacks of state-of-the-art methods and BHHOA addresses them as follows.
1) Inefficient feature selection: Classical classifiers have drawbacks, such as using static or manually picked features containing useless or redundant information. This reduces categorization accuracy. BHHOA uses a multi-objective contrastive transformation strategy to optimize feature selection. It selects key features to improve classification accuracy and reduce computational load.
2) Low generalization and high false positives: Conventional models like KNN and MLP may have significant false-positive rates due to difficulty in generalizing across various datasets. The suggested BHHOA solution uses a robust metaheuristic method to reduce false positives and negatives, enhancing the dataset's generalization.
3) Computational complexity: SVM and MLP can be costly, particularly for large datasets. BHHOA's solution reduces computational complexity by efficiently searching the solution space utilizing herd-based optimization, reducing resource consumption.
4) Limited accurcy: Existing techniques frequently rely on single-objective optimization, which might limit feature selection and classification accuracy. A multi-objective framework optimizes objectives, resulting in a more balanced and effective spam detection model.
5) Inflexibility: Classical approaches are difficult to adjust to the changing nature of spam emails. BHHOA's adaptability makes it more resilient to evolving spam patterns and sophisticated attacks.
6) Limited comparative performance: Minor gains in earlier benchmarks hinder practical use.
2 Horse Herd Optimization Algorithm
Very recently, a plethora of heuristic optimization algorithms have found application in solving diverse optimization problems, owing to their capacity to mathematically model and tackle real-world challenges. This research sought to leverage a novel heuristic optimization algorithm to address the problem of selecting the necessary features for spam email detection. Hence, HHOA was selected as the primary method for this purpose. HHOA, as introduced by MiarNaeimi et al.[14], is a robust heuristic optimization algorithm which was drawn from the horses' herding behaviours across varied age groups. With a multitude of control factors derived from the behaviours of horses at various stages, HHOA demonstrates exceptional performance in handling problems with complex and a high number of dimensions. Its efficacy has been assessed at large number of dimensions, reaching nearly 10000, using popular test functions, and it has proven to be highly efficient in both exploration and exploitation. The integration of Taylor series into the horse herd optimization algorithm facilitates clustering of subgraphs in a web page recommendation system. With the ability to swiftly identify optimal solutions at minimal cost and complexity, HHOA outperforms many established metaheuristic optimization algorithms in terms of both accuracy and efficiency.
Horses exhibit diverse behaviours throughout their lifespans, with a typical maximum lifespan ranging from 25 to 30 years. Horses are categorized into four groups based on age: Those aged 0-5, 5-10, 10-15, and older than 15, denoted as δ, γ, β, α respectively. HHOA models the social interactions of horses using six fundamental behaviours observed across different ages: Grazing, hierarchy, sociability, imitation, defence mechanism, and roaming. Horse movement at each iteration is described by:
(1)
where, X is a position vector in the search space, a potential solution to the optimization problem, m is index of the individual /agent / solution in the population, and V is variation vector ( velocity/ mutation / update). Horse position (mth ) is given byrange of the horse is given by AGE, and current iteration is given by iter.
To ascertain the horses’age, each iteration necessitates a comprehensive response matrix. This response matrix is organized based on the most favorable responses, with the top 10% of horses selected as category δ . The subsequent 20%, 30%, and 40% of the remaining horses comprise categories γ,β, and α, respectively. To determine the velocities in terms of a vector, the simulation of the six aforementioned behaviors is mathematically executed.
(2)
(3)
(4)
(5)
where G,D,H,S,R represents scalar multiplications. m represents the m⁃th agent ( horse/particle) in the population; a ( attraction vector) represents the direction of movement towards better solutions or leaders; e ( escape vector) represents the direction used to avoid danger or predators; r ( roaming ( exploration vector) represents random exploratory movement; o ( social/ orientation vector) represents alignment or influence from neighboring agents.
The behaviors, which are mentioned above, are elaborated with their implementation, illustrated asfollows.
2.1 Grazing
This is a kind of behaviour, where horses are known for their grazing behavior, consuming grasses, plants, and small animals, with a typical pasturing duration ranging from 16 to 20 h per day. This slow and persistent eating habit is characteristic of their behavior. In the HHOA, the grazing behavior is represented mathematically by assigning a coefficient to denote the grazing space around each individual horse. This coefficient signifies the area where the horse is actively grazing, simulating its foraging activity within a specific space.
(6)
(7)
where, i th position horse’s motion parameter,indicating its tendency to graze, The limits of the pasturing area is expressed asand, whereis upper bound andis lower bound. In addition, ρ is a measure that lies in the range of [0,1].
2.2 Hierarchy
Horses typically rely on a leader for guidance, which can be an adult stallion, or a mare, adhering to the hierarchy principle. In a horse herd, the most experienced and strongest individual often assumes the leadership role, with others following suit. Horses falling within the age range of 5 to 15 ( categories β and γ ) have been observed to adhere to the hierarchy principle and follow the lead of the dominant horse. It can be described as follows:
(8)
(9)
where is the influence of the leader horse’s position on velocity, and the place of that horse is indicated by
2.3 Sociability
Sociability is a key behavioral trait observed in horses, which serves as a source of inspiration for HHOA. Horses naturally seek social engagement and often coexist harmoniously with other peer animals, enhancing their chances of survival. Some horses even demonstrate a preference for companionship, extending to species such as cattle and sheep. This sociable behavior is particularly prominent in horses aged 5 to 15 years old. In HHOA, sociability isreflected in the movement of horses towards the positions of other herd members, facilitating socialization within the group. This behavior is mathematically modelled as follows:
(10)
(11)
where, N is total number of individuals in the population,social vector motion ofhorse is given byand group cohesion is given byat interth teration.
2.4 Imitation
Horses have a tendency to learn both positive and negative habits and behaviors from one another through imitation, a behavioral trait that also influences HHOA. Young horses, in particular, are inclined to mimic the actions of their peers, and this imitation behavior remains prevalent throughout their lives. This imitation is described by following equations:
(12)
(13)
where, p ∈ (0,1] is a proportion of the population size N, and f is behavioural learning rate. The vector indicating the movement of the m th horse towards the mean position of optimal horses is represented as pN, where pN denotes the number of horses with favorable positions.
2.5 Defense
Horses rely on a “ fight⁃or⁃flight ” response mechanism to defend themselves, typically opting to flee when confronted with danger. However, when cornered, they may resort to bucking as a defensive tactic. Additionally, horses engage in confrontations to assert dominance over resources such as food and water, and to fend off threats from predators like wolves. This defensive behavior of horses is mirrored in HHOA, where horses avoid non⁃optimal responses by moving away from them. This process of defense mechanism is given by using equations as follows:
(14)
(15)
where, the vector for escaping of mth horse from the average of some horses with worse positions is indicated as. The number of horses count with worse positions is given as pN and the reduction parameter per cycle is shown as ωξ.
2.6 Roam
The last kind of horse behavior simulated by HHOA is their tendency to roam. In their natural habitat, horses roam and graze, moving from one place to another if they are not confined to stables. They may swiftly change their grazing location and exhibit curiosity by exploring various pastures to familiarize themselves with their surroundings. In HHOA, the behavior of roaming is represented as a random movement of horses within the herd. This behavior can be depicted as follows:
(16)
(17)
here,specifies the mth horse velocity when random movement is made for a local search and the parameter what is used for reduction per cycle is depicted as. For different age groups, the horses’velocity obtained is expressed as follows.
For the δ horses, the velocity ( horses at the age of 0-5) is given as follows:
(18)
For the γ horses (horses at the age of 5-10), the velocity is given as follows:
(19)
For the β horses, ( horses at the age between 10 and 15 years), the velocity is given as follows:
(20)
For the α horses ( horses older than 15), the velocity is given as follows:
(21)
Adult horses categorized as α initiate search locally around the aim of global optimum with exceptional precision. Horses categorized as β seek out neighbouring environment around the adult α horse, aiming to draw closer to them. Conversely, horses categorized as γ exhibit a little bit lower interest in reaching the α horses and instead demonstrate a very strong inclination to find other places and uncover additional globally optimum places. Due to their distinct behavioural characteristics, young horses categorized as δ are well⁃suited for the random search phase.
3 The Proposed Method
Initially, the metaheuristic algorithm HHOA undergoes modification, followed by the utilization of the adapted HHOA for the selection of features in spam email detection. This continuous HHOA is initially converted to a kind of binary to suit the discrete nature of the problem of selecting features. Subsequently, the inputs of the resultant algorithm are transformed into contrastive inputs. Following this, the binary contrastive HHOA is further enhanced to support multi⁃objective optimization, facilitating the resolution of the problems which are multi⁃objective. At last, this multi⁃objective contrastive binary HHOA is employed for spam detection purposes. Spam emails are often received by users from unknown senders with unusual email addresses. Therefore, it’s crucial to employ suitable methods for detecting and distinguishing emails which are spam, from legitimate ones containing important information. Each email received from the web server undergoes a series of steps to determine whether it is spam. The initial step after receiving an email involves extraction of features, in which a collection of both general and specific features is extracted from the email contant.Following feature extraction, the subsequent phase is selection of features, which discerns relevant features while eliminating irrelevant and duplicate ones. At last, the step entails classification, where emails are categorized as either spam or non⁃spam. The complete framework of this method is illustrated in Fig.1 and Fig.2 which depict the flow of steps of the novel approach and its functionality identifying spam emails. Subsequent sections offer comprehensive insights into each step involved in adapting the HHOA.
Fig.1Frame work of the proposed method
Fig.2Flowchart related to the proposed approach
3.1 Binary HHOA
The process of optimization varies notably between binary and continuous search spaces. In a continuous search space, a step length is added to adjust the position vector by horse search agents. However, in a binary search space, this method is not applicable, as the position vector of search agents can only hold values of 0 or 1. Consequently, it was necessary to devise a binary adaptation of the HHOA tailored for the selection of features, which inherently entails discrete problem⁃solving [1 5 ]. The binary version of the HHOA algorithm can be devised in a simple and straightforward way. Here, the variables lower and upper bound,which are between zero and one,need to be set, then we execute the algorithm.Just before inputting the values into the cost function, apply the greatest integer function to round them to a zero⁃one vector. While the variables remain continuous, they are treated as binary by the cost function, which only occurs before entering the cost function. Essentially, the algorithm views the problem as continuous, while the cost function treats it as discrete. Additionally, a function facilitates communication between the discrete cost function (binary) and the continuous algorithm. This function is achieved by utilizing the greatest integer function, where x represents a real value between two consecutive integers m and n, resulting in an integer k after applying the greatest integer function to x. This approach effectively addresses the challenge of adapting a continuous algorithm for use in discrete problems [1 6 ].
3.2 Contrastive Binary HHOA
By examining contradictory solutions, contrastive learning enhances the likelihood of commencing with a superior initial population. This approach is not only applicable to initial solutions but can also be continuously applied to any solution within the current population. Typically, contrastive learning is integrated into metaheuristic approaches to enhance convergence [1 7 ]. As the temporal complexity of metaheuristic algorithms escalates, contrastive learning serves to mitigate these constraints. This strategy entails the metaheuristic method seeking optimal solutions in the reverse direction of the current solution. Subsequently, it evaluates and identifies the best solution from the current and opposite directions. This methodology accelerates solution convergence, moving it nearer to the optimal solution [18].
3.3 Multi⁃Objective Contrastive Binary HHOA
Optimization models aimed at solving problems with only a single objective function are referred to as single⁃objective models. In such problems, the goal is to identify the optimal solution from a set of available alternatives. However, in practical scenarios in design and engineering domains, many problems involve multiple objective functions[19]. These kinds of problems are known as multi⁃objective optimization. Spam detection poses a multi⁃objective challenge, aiming to optimize two primary objectives:Reducing the number of features while maximizing classification accuracy. Achieving higher classification accuracy ensures most emails are correctly categorized, with minimal classification errors. Given that the modifiedHHOA metaheuristic algorithm’s feature selection significantly influences classification, minimizing feature count is crucial to prevent complexity. Since multiple objective functions are involved, employing a multi⁃objective optimization method becomes necessary. Such methods offer engineers and system designers with multiple solutions that balance various objectives[20].
The primary distinction between single⁃objective and multi⁃objective HHOA lies in how they update objectives. In single⁃objective search spaces, selecting the best solution is straightforward. Conversely, in multi⁃objective HHOA, from a set of optimal solutions, the objective must be chosen. These optimal solutions are preserved, and one of them ultimately serves as the objective. The challenge here lies in enhancing the distribution of stored solutions by finding an objective. To achieve this, the number of neighboring solutions within the existing solution’s vicinity is initially computed.
Selection of features in spam detection presents a multi⁃objective optimization challenge. This entails balancing two conflicting objectives: 1) Reducing the number of selected features and 2) increasing classification accuracy. Consequently, defining the objective function for feature selection necessitates the use of a classification algorithm.
3.4 Detection of Spam Using Multi Objective Contrastive Binary HHOA
Features seletion comprises four steps: Generating feature subsets, evaluating these subsets, checking stopping criteria, and validating the results. Initially, a feature subset is generated within the dataset, in which candidate features are identified based on the search strategy of multi⁃objective contrastive binary HHOA [21]. Subsequently, these candidate subsets are evaluated and compared with the best previous value of the evaluation feature. If a superior subset is discovered, it replaces the previous best subset. This iterative process of generating and evaluating subsets continues until the termination criterion of multi⁃objective contrastive binary HHOA is met. This process iterates multiple times until reaching the best global solution. Following each iteration, the fitness function computes the classifier’s accuracy for the candidate subset. The process of candidate generation, fitness calculation, and evaluation function persists until the final criteria are satisfied. Typically, stopping criteria are determinedby two factors: The error rate and the number of iterations. If the error rate falls below a certain value or the algorithm surpasses the given number of iterations, the algorithm halts[22-23].
3.5 Feature Selection Using HHOA
Feature selection is an essential preprocessing phase in machine learning that identifies a subset of pertinent features to improve model accuracy and diminish computational complexity. In email categorization, proficient feature selection can enhance spam detection by eliminating redundant and irrelevant information. The HHOA, derived from the social dynamics of equine herds, has been modified into a binary variant for feature selection. In contrast to conventional algorithms, HHOA sustains a dynamic equilibrium between exploration ( seeking new solutions) and exploitation ( enhancing existing solutions) [24].
3.5.1 Binary conversion of HHOA
The original HHOA, being continuous, was converted to a binary format with a sigmoid⁃based transfer function. This function transforms continuous values into probabilities indicating the selection status of a feature. The feature selection procedure can be articulated as follows:
(22)
where, S(x) is the sigmoid function value, x is the continuous position of the horse in the search space. and e is Euler’s number, a mathematical constant approximately equal to e ≈ 2.71828. A threshold is applied to the sigmoid output to determine the selection of a feature:
(23)
where r is a random number between 0 and 1, 1 indicates that the feature is selected, while 0 indicates that the feature is not selected. i means iteration (i = 1, 2,...,n).
3.5.2 Multi⁃objective optimization
The proposed HHOA follows a multi⁃objective optimization approach that simultaneously optimizes two key objectives. Minimization of classification error: Ensures the algorithm maintains high accuracy; Minimization of selected features: Reduces model complexity by selecting the most informative features. The fitness function is defined as:
(24)
where, F represents fitness function value, E is classification error using a base classifier ( e. g., SVM, KNN), ∣ S ∣ is number of selected features, ∣ T ∣ is total number of features, α is balancing parameter between error and feature selection. If α is closer to 1, the algorithm prioritizes minimizing classification error. If α is closer to 0, it focuses on minimizing the number of selected features. This adaptive fitness function ensures both accuracy and feature reduction.
3.5.3 Exploration and exploitation
HHOA conducts exploration through dominant equine behaviour and exploitation via herd⁃following behaviour. These two actions guarantee the algorithm effectively identifies appropriate feature subsets: Leading horse behavior: Promotes global search to explore new areas in the solution space; Following horse behavior: Promotes local search by refining promising solutions. The position update for horses is given by:
(25)
where, Xi(t) is current position of horse i, Xl is position of the leading horse, Xb is best position achieved so far, C and R are random coefficients to maintain diversity.
The proposed HHOA method utilizes a termination criterion to guarantee efficient convergence. The algorithm terminates when any of the following conditions are met: A predetermined maximum number of iterations is reached, and the enhancement in the fitness function becomes trivial. It drops below a designated threshold, or the algorithm converges to an optimal solution where further enhancements are no longer substantial. This adaptive termination technique eliminates superfluous computations, guaranteeing prompt and efficient performance while preserving the precision of the classification outcomes. The binary HHOA facilitates efficient feature selection by diminishing the feature count while maintaining classification accuracy. This not only reduces computing expenses but also improves the model’s interpretability. The exceptional efficacy of HHOA, as demonstrated by the experimental findings, confirms its proficiency in enhancing email spam detection.
4 Results Analysis
The studies were performed on a Windows 11machine with an Intel Core i7 processor, 16 GB of RAM, and an NVIDIA RTX 3060 GPU. The Binary Horse Herd Optimization Algorithm (BHHOA) was executed in Python 3. 10 utilizing NumPy, Pandas, Scikit⁃Learn, and Matplotlib modules. A binary variation of HHOA was created and combined with conventional classifiers, including KNN, MLP, SVM, and Naive Bayesian, for comparative evaluation. The algorithm underwent rigorous validation via comprehensive testing and parameter optimization. Performance evaluation was executed utilizing accuracy, precision, recall, F1 score, False Positive Rate ( FPR), and False Negative Rate( FNR) to assess classification accuracy, error mitigation, and robustness. The study examined critical research questions regarding BHHOA’s efficacy, feature selection abilities, error reduction, and computational efficiency. The Kaggle email classification dataset [25-26], consisting of 57000 emails with a spam⁃to⁃non⁃spam ratio of 30% - 70%, was utilized for assessment. Following TF⁃IDF preprocessing for text representation, the dataset was divided into 80% for training and 20% for testing. Five⁃fold cross⁃validation guaranteed dependable and impartial outcomes. Table1 and Figs.3-4 show how BHHOA outperformed existing methods.
Table1Comparision of results of proposed methodology with traditional methods
Fig.3Bar graph showing accuracy comparison to traditional methods
Fig.4Bar graph showing F1⁃score comparison to traditional methods
The Receiver Operating Characteristic (ROC)[26]curve illustrates the balance between TPR and FPR. AUC ( Area Under the Curve) scores approaching1 signify superior performance as shown in Fig.5. This demonstrates BHHOA’s enhanced categorizationcapability.
4.1 Computational Efficiency
BHHOA shows competitive training and prediction times, as shown in Table2 and Fig. 6, making it a more efficient choice for real⁃time applications.
Fig.5Line graph showing AUC Score
Table2Comparison of training time and prediction time
Fig.6Bar graph showing computational efficiency
4.2 Error Rate Comparison
BHHOA demonstrates the lowest FPR and FNR, as shown in Table3 and Fig.7, further justifying its reliability. Fig. 8 is the convergence diagram for HHOA. The fitness value gradually declines after 50 iterations, signifying successful convergence. The slight variations indicate that the algorithm continues to investigate the search space while enhancing the solution.
Table3Comparison of error rate
4.3 Statically Analysis with Chi⁃Square Test Results
A chi⁃square test [27] was performed to assess the statistical significance of the classification performance of the proposed HHOA model. The findings revealed a chi⁃square statistic of 9070. 30 with a p⁃value of 0, which is extremely small, signifying a substantial disparitybetween the observed and predicted values. With one degree of freedom, the test validated that the model’s classification accuracy is not attributable to random chance. The minimal rates of false positives and negatives further substantiate the model’s reliability. The results are shown in Table4.
Fig.7Bar graph showing error rate comparison
Fig.8Convergence graph HHOA
This statistical validation reinforces the assertion that the proposed HHOA algorithm proficiently distinguishes spam and non⁃spam emails, surpassing traditional methods.
Table4Chi⁃Square Test Results
5 Conclusions and Future Work
Unsolicited emails, commonly known as spam, pose a significant challenge for both internet users and data centers. They consume substantial storage and resources while also serving as a gateway forintrusions, cyber⁃attacks, and user information is also accessed without authorization. The objective of this research was to utilize a robust metaheuristic optimization algorithm to identify spam emails in email services. To achieve this goal, the paper utilized the horse herd optimization algorithm, a newly developed nature⁃inspired metaheuristic optimization approachdesigned to address exceedingly intricate optimization challenges. This paper outlines the construction of a spam email detection model employing classification with optimization. A comparison of detecting emails that are not legitimate, without optimization and with optimization, demonstrates that optimization significantly enhances accuracy.
5.1 Main Challenges of HHOA
The suggested HHOA algorithm encounters multiple obstacles that may affect its performance. It demonstrates parameter sensitivity, requiring meticulous adjustment to attain optimal outcomes. Moreover, although it is computationally efficient relative to some models, handling extensive datasets may result in prolonged computation time. A further disadvantage is the potential for redundancy, in which irrelevant or superfluous features may be selected despite the optimization procedure. Moreover, the algorithm may face generalization challenges, as its efficacy can fluctuate when applied to datasets with markedly distinct spam characteristics. Mitigating these problems could further augment the algorithm’s resilience and utility.
5.2 Limitations of the Study
The suggested HHOA algorithm has some drawbacks that may hinder its wider use. Its reliance on the Kaggle spam dataset may restrict its applicability to other datasets with varying attributes. The algorithm is also sensitive to initial parameter configurations, which can affect its performance. Evaluating multilingual spam datasets is essential to assess their efficacy across various languages. Furthermore, when applied to exceptionally massive datasets, the technique may encounter computational overhead, affecting efficiency. Mitigating these restrictions can improve the algorithm’s resilience and scalability.