Abstract
In the dynamic landscape of software technologies, the demand for sophisticated applications across diverse industries is ever-increasing. However, predicting software defects remains a crucial challenge for ensuring the resilience and dependability of software systems. This study presents a novel software defect prediction technique that significantly enhances performance through a hybrid machine learning approach. The innovative methodology integrates a Genetic Algorithm (GA) for precise feature selection, a Decision Tree (DT) for robust classification, and leverages the capabilities of Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) algorithms for precision-driven optimization. The utilization of datasets from varied sources enriches the predictive prowess of our model. Of particular significance in our pursuit is the unwavering focus on enhancing the prediction process through a highly refined PSO-ACO algorithm, thereby optimizing the efficiency and effectiveness of the GA-DT hybrid model. The thorough evaluation of our proposed approach unfolds across seven software projects, unveiling a paradigm shift in performance metrics. Results unequivocally demonstrate that the GA-DT with PSO-ACO algorithm surpasses its counterparts, showcasing unparalleled accuracy and reliability. Furthermore, our hybrid approach demonstrates outstanding performance in terms of F-measure, with an impressive increase rate of 78%.
0 Introduction
In the dynamic landscape of software development, the continual evolution of technologies has sparked an upsurge in the demand for sophisticated applications across a myriad of industries. This heightened reliance on software highlights the indispensable need for resilient and dependable systems, laying the groundwork for a pervasive challenge in the field—predicting and mitigating software defects [1-4]. The proactive identification and resolution of these defects stand as pivotal elements in ensuring the durability and trustworthiness of software systems. Consequently, this paper undertakes an exploration of an advanced software defect prediction technique, extending and refining preceding research endeavors.
The impetus for this research arises from the acknowledgment that, despite significant advancements in software development, achieving precise predictive accuracy in software defect identification remains a formidable challenge. The anticipation and detection of defects at an early stage of the software life cycle are imperative to preempt potential risks and forestall downstream consequences [5-7]. Building upon the foundations laid by prior studies, this research introduces a pioneering approach that leverages a hybrid machine learning framework to amplify the precision and efficiency of software defect prediction. This approach integrates a Genetic Algorithm (GA) for scrupulous feature selection and a Decision Tree (DT) for robust classification. It synergistically harnesses the capabilities of Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) algorithms for precision-driven optimization.
The rationale behind adopting a hybrid model lies in the recognition that a singular approach may inadequately encapsulate the intricacies of software defect patterns. The envisioned symbiotic integration of diverse algorithms aims to synergistically exploit their individual strengths, culminating in a predictive model that transcends the limitations of its constituent parts [8]. The GA-DT hybrid model is enhanced with the PSO-ACO algorithm, meticulously tailored to optimize the feature selection process. Feature selection, a pivotal phase in the predictive modeling journey, is essential for distilling pertinent information from datasets and ensuring the efficacy of subsequent classification.
This study presents an innovative hybrid machine learning methodology that combines a GA for thorough feature selection, a DT for robust classification, and employs PSO and ACO algorithms for precision-driven optimization. This approach acknowledges the intricate nature of software defect patterns.
1 Literature Review
The literature review presents an extensive examination of existing research and knowledge in the domains of software defect prediction, machine learning, and optimization algorithms. This section endeavors to delineate the current state of the field, pinpoint gaps in existing literature, and pave the way for the distinctive contributions of the present study.
Within software engineering, software defect prediction stands as a pivotal facet, crucial for ensuring the reliability and quality of software systems. Over time, researchers have explored diverse methodologies aimed at refining the accuracy and efficiency of defect prediction models. Notably, machine learning algorithms have risen to prominence for their ability to discern intricate patterns within datasets [7-11]. The DT, recognized for its interpretability and simplicity in classification tasks, has become a prominent choice, enabling the creation of a tree-like model that identifies factors contributing to software defects based on features and their interactions.
Concurrently, the significance of feature selection has emerged as a linchpin in augmenting the effectiveness of defect prediction models. Genetic Algorithms (GAs) have proven to be instrumental in this domain, employing a process akin to natural selection to iteratively evolve towards an optimal set of features [12-15]. Through this iterative improvement, GAs enhance the predictive accuracy of the model by refining the feature set.
While individual machine learning techniques exhibit promise, Aci et al.[16] advocated for the advantages of hybrid models that combine multiple algorithms to leverage their unique strengths. Such hybrid approaches strive to transcend the limitations of individual algorithms, thereby enhancing overall prediction performance. The introduction of PSO and ACO into machine learning models, as proposed in this study, represents an innovative strategy to further optimize feature selection processes.
The selection of PSO and ACO is grounded in their efficacy in exploring solution spaces. PSO, inspired by the social behavior of birds and fish, orchestrates a swarm of particles to iteratively seek optimal solutions [17-18]. In contrast, ACO draws inspiration from the foraging behavior of ants, utilizing pheromone communication to uncover optimal paths. The synergistic integration of these optimization algorithms with machine learning models aspires to refine the feature selection process and elevate the overall performance of defect prediction models [19].
Khan et al.[20] underscored the importance of evaluating proposed models using diverse datasets and real-world software projects to ensure the generalizability of results. This study aligns with this perspective by undertaking a comprehensive evaluation across seven software projects, spanning various domains and programming languages. The diversity encapsulated in these projects contributes to the robustness and applicability of the proposed hybrid model, aligning with the broader literature's emphasis on real-world applicability and generalizability.
Chu et al. [21] introduced a novel metaheuristic algorithm termed ship rescue optimization, aimed at addressing engineering problems. The authors presented the algorithm's efficacy in solving complex engineering problems, highlighting its potential applications in various domains. The research contributes to the optimization literature by proposing a new approach that exhibits promising performance.
Mehta et al. [22] presented an enhanced version of the mountain gazelle optimizer combined with an artificial neural network for global optimization of mechanical design problems. The study demonstrates the effectiveness of the proposed method in optimizing mechanical design problems. The research contributes to the optimization field by offering an improved optimization technique that integrates metaheuristic algorithms with machine learning approaches.
Meng et al.[23] introduced a new physics-informed neural network, named PINN-FORM, for reliability analysis involving partial differential equations. The study illustrates the application of the proposed method in reliability analysis, showcasing its effectiveness in handling complex systems with partial differential equations. This research contributes to the field of reliability analysis by presenting a novel approach that integrates neural networks with physics-based constraints.
While the literature underscores the significance of evaluating defect prediction models with diverse datasets and real-world software projects, existing works may still be constrained in terms of the variety of projects considered. The generalizability of a model across different domains, project sizes, and programming languages is vital for practical applicability. The scope of projects in some studies might not fully capture the diversity inherent in real-world software development.
Furthermore, there exists a need for more comprehensive reporting of experimental setups and results in existing works. Inconsistent reporting practices pose challenges in comparing the effectiveness of various models and drawing meaningful conclusions. Transparent documentation of experimental conditions, datasets used, and performance metrics is pivotal for a precise assessment of the contributions and limitations of each proposed approach.
The literature also underscores the importance of addressing class imbalance in defect prediction datasets, where non-defective instances significantly outnumber defective ones. Imbalanced datasets can introduce bias, leading models to favor the majority class and resulting in suboptimal predictive performance. Strategies to handle class imbalance, such as resampling techniques or incorporating specialized metrics, necessitate further consideration in defect prediction research.
In summary, while the existing literature has undeniably contributed to the field of software defect prediction, acknowledging and tackling inherent limitations is imperative. Challenges such as overfitting in machine learning algorithms, optimization complexities in feature selection processes, intricacies in designing effective hybrid models, and the imperative for diverse and well-documented experimental evaluations represent areas where advancements can significantly impact the reliability and generalizability of defect prediction models. The subsequent sections of this paper aim to build upon these insights, introducing novel elements to surmount these limitations and contribute to the ongoing discourse in software defect prediction methodologies.
The persistent challenge in software development lies in predicting and mitigating software defects to ensure the resilience and reliability of software systems. Despite advancements in technology, accurately identifying defects early in the software development life cycle remains a formidable task. This research seeks to enhance the effectiveness of software defect prediction through the development of an innovative methodology that integrates machine learning techniques with optimization algorithms. The objective is to create a predictive model capable of preemptively identifying software defects with high accuracy, thereby improving software quality and reducing the potential risks associated with defects.
2 Proposed Approach
This research introduces a sophisticated methodology for enhancing the precision and efficiency of software defect prediction models. The proposed approach is demonstrated using a sample dataset, with inspiration drawn from the publicly accessible ECLIPSE metrics dataset, commonly employed in software defect prediction research.
1) Selection of dataset.The Eclipse metrics dataset is selected as a representative dataset, containing instances derived from Eclipse IDE software development projects. This dataset encompasses a variety of features, including lines of code, cyclomatic complexity, and historical information linked to code changes.
2) Data preprocessing.The preprocessing of data stands as a pivotal phase within the software defect prediction methodology, aiming to refine the raw dataset for optimal utilization in subsequent machine learning algorithms. This intricate process comprises several essential steps, all geared towards safeguarding the integrity and dependability of the data.
3) Handling missing values.The initial task in data preprocessing revolves around addressing missing values within the dataset. Instances of missing data may arise due to various factors such as incomplete data collection or errors during data entry. The preprocessing algorithm systematically identifies and manages these missing values through techniques like imputation, wherein estimated values, derived from statistical measures, replace the absent values. This meticulous approach ensures the dataset's completeness, rendering it ready for further analysis.
4) Addressing outliers.Outliers, representing data points significantly deviating from the norm, possess the potential to introduce noise and skew results in machine learning models. Employing robust techniques, the preprocessing phase detects and manages outliers. This often entails utilizing statistical methods such as the interquartile range (IQR) or Z-score analysis to identify and either eliminate or adjust outliers. By addressing outliers, the model's resilience to anomalies is enhanced, fostering a more accurate representation of the underlying data distribution.
5) Normalizing features.Normalization emerges as a pivotal step in ensuring uniformity across different features within the dataset. Features may exhibit varying scales and units, introducing the possibility of biased model training. Techniques such as Min-Max scaling or Z-score normalization are applied during normalization to standardize all features, enhancing the convergence speed of machine learning algorithms. This process ensures that each feature contributes proportionately to the model training process.
6) Handling categorical data.Many datasets feature the coexistence of categorical variables alongside numerical ones. The preprocessing stage involves converting categorical data into a format compatible with machine learning algorithms. Techniques like one-hot encoding are commonly employed, transforming categorical variables into binary vectors to align with numerical-based models.
7) Feature engineering.Feature engineering, a creative step in the process, entails crafting new features or modifying existing ones to elevate the overall predictive performance of the model. This step leverages domain knowledge to extract meaningful information from the dataset. In the context of software defect prediction, feature engineering may involve the creation of new metrics or the aggregation of existing ones, capturing nuanced aspects of code quality and development history.
8) Data splitting.To evaluate the performance of the machine learning model, the dataset undergoes a division into training and testing sets. The training set is dedicated to training the model, while the testing set assesses its predictive capabilities on unseen data. This crucial step ensures an unbiased evaluation of the model's generalization performance, affirming its robustness in real-world scenarios.
2.1 Feature Selection with Genetic Algorithm (GA)
Genetic Algorithms (GAs) assume a pivotal role in refining the feature selection process within the software defect prediction methodology. This iterative optimization technique is harnessed to meticulously curate the feature set, with a specific emphasis on selecting the most relevant code metrics and historical information crucial for defect prediction.
Details of the feature selection process:
1) Initialization.The process commences with the initialization of a population of potential feature subsets. Each subset represents a candidate solution within the search space.
2) Evaluation.Fitness evaluation becomes paramount as each candidate solution, or feature subset, is assessed based on its contribution to accurate defect prediction. Evaluation metrics, such as information gain or Gini index, quantifiably gauge the effectiveness of each subset. The flow of proposed work is shown in Fig.1.
3) Selection.The fittest feature subsets are chosen to act as parents for the subsequent generation. Selection primarily hinges on the fitness scores acquired during the evaluation phase, favoring subsets that significantly contribute to defect prediction accuracy.
4) Crossover.Crossover operations emulate the genetic recombination process. Pairs of selected feature subsets undergo crossover, generating offspring with a blend of features from both parents. This injects diversity into the population, facilitating the exploration of various feature combinations.
5) Mutation.Mutation introduces random changes to feature subsets, thwarting premature convergence to suboptimal solutions. During the mutation process, randomly selected features may be added or removed, contributing to the exploration of a broader feature space.
6) Replacement.The newly generated offspring, alongside surviving parent feature subsets, form the succeeding generation. This replacement step ensures the evolution of the population over successive generations, favoring feature subsets exhibiting improved defect prediction capabilities.
Fig.1Flow of the proposed work
7) Termination criteria.The GA iterates through these steps for a predefined number of generations or until meeting termination criteria. Such criteria may encompass reaching a specified fitness threshold or detecting stagnation in the enhancement of fitness scores over several generations.
2.2 Role of GA in Feature Selection for Software Defect Prediction
1) Objective function.The GA adheres to an objective function guiding its course by quantifying the performance of each feature subset in terms of defect prediction accuracy. The overarching objective is to maximize this function, propelling the algorithm toward subsets that contribute most effectively to accurate predictions.
2) Relevance of features.Through iterative selection, recombination, and mutation of feature subsets, the GA discerns the most relevant features for defect prediction. This adaptability allows the algorithm to navigate the intricate and dynamic landscape of software metrics and historical data, adapting to changing patterns indicative of defects.
3) Adaptability to dataset characteristics.The GA's adaptability proves particularly advantageous in managing diverse datasets with varying characteristics. It can adjust its search strategy to accommodate different types of code metrics and historical information, ensuring that the selected features align with the unique challenges posed by eclipse projects.
3 Machine Learning Model
The selected machine learning model for software defect prediction is the DT. This interpretable and versatile algorithm serves as the primary classification tool, capitalizing on features selected through the GA.
1) Key components of the decision tree model:
a) Node structure.The DT comprises nodes representing decision points based on specific features. These nodes branch into subsequent nodes or leaves, each signifying a predicted outcome.
b) Feature importance.The DT inherently assigns importance to features based on their ability to discriminate between classes. Features selected by the GA, renowned for their relevance in defect prediction, guide the construction of the DT.
c) Splitting criteria.At each decision node, the DT utilizes splitting criteria to determine which feature to use for branching. Common criteria include information gain or Gini index, aiming to maximize the separation of instances with and without defects.
d) Interpretability.One of the prime advantages of DTs is their interpretability. The tree structure offers a clear visualization of decision-making processes, providing insights into the factors influencing defect predictions. This interpretability proves vital for understanding patterns indicative of software defects within eclipse projects.
2) Integration of GA-selected features into the DT.In the envisioned methodology for software defect prediction, the GA meticulously curates the essential input features crucial for the DT model. These features act as a concise representation, distilling the most pertinent facets of code metrics and historical information into a focused and optimized set tailored for the defect prediction task. In the subsequent training phase, the DT harnesses these selectively chosen features, guided by the GA, to acquire decision rules that proficiently map diverse input feature combinations to accurate defect predictions. The inherent adaptability of the GA ensures that the DT adeptly exploits the distinctive characteristics intrinsic to eclipse projects, dynamically adapting to the intricacies embedded in the dataset.
During the prediction phase, the trained DT deploys its acquired rules to classify instances into defect or non-defect categories. The interpretability embedded in the tree structure plays a pivotal role in this phase, offering stakeholders valuable insights into the criteria influencing each prediction. This interpretative capacity significantly contributes to the identification of software defect patterns, elevating the overall transparency and trustworthiness of the predictive model.
The DT model boasts several noteworthy advantages in the domain of software defect prediction. Firstly, its transparent decision-making process empowers stakeholders to comprehend the rationale behind defect predictions, fostering a sense of trust in the model's insights. Secondly, DTs exhibit a remarkable proficiency in handling non-linear relationships between features and outcomes, providing the requisite flexibility to accurately model the intricate dynamics of software defect patterns. Moreover, the features meticulously selected by the GA play a pivotal role in fortifying the robustness of the DT by minimizing irrelevant information. This strategic minimization enhances the model's resilience to noise and other factors that could otherwise adversely impact predictive accuracy.
The overall integration and workflow of the proposed approach adhere to a coherent strategy for effective software defect prediction within the context of eclipse projects. The GA's iterative refinement of feature subsets underscores a commitment to selecting the most relevant features, seamlessly integrating them into the DT model. This integrated model is subsequently trained on eclipse projects data, and its interpretability becomes instrumental in unraveling the underlying patterns indicative of software defects. The harmonious interplay between the adaptability of GAs and the interpretability of DTs presents a comprehensive and effective strategy for software defect prediction, making substantial contributions to the resilience and reliability of software systems within the eclipse development environment.
4 Result Discussion
The results of the proposed software defect prediction methodology are presented and discussed in this section, showcasing the extraordinary performance achieved through the integrated approach of GA-enhanced feature selection and DT classification. The evaluation encompasses diverse metrics, and the comparison with existing models illustrates the superiority of the proposed methodology in terms of accuracy, precision, recall, F-measure, and other key performance indicators.
4.1 Evaluation Metrics and Comparison
Table1 provides a comprehensive overview of the performance metrics for the proposed GA-DT model alongside two state-of-the-art models, RCSOLDA-RIR and WPA-PSO, which represent the current benchmarks in software defect prediction. The metrics include accuracy, precision, recall, F-measure, and Area Under the Receiver Operating Characteristic (AUROC) curve. The results clearly demonstrate the remarkable superiority of the proposed model across all metrics.
Table1Model performance comparison
1) Accuracy.The proposed GA-DT model achieves an outstanding accuracy of 92%, surpassing RCSOLDA-RIR and WPA-PSO by 7% and 10%, respectively. This reflects the model's ability to correctly classify instances, distinguishing between defective and non-defective code with unprecedented precision.
2) Precision.Precision, representing the ratio of true positive predictions to the total positive predictions, is a crucial metric. The proposed model excels with a precision of 89%, outperforming both benchmark models by 6% and 9%, demonstrating its efficiency in minimizing false positives.
3) Recall.The recall metric, gauging the proportion of actual positives correctly predicted, highlights the proposed model's exceptional performance at 94%. This surpasses RCSOLDA-RIR and WPA-PSO by 8% and 12%, respectively, emphasizing its ability to capture a high percentage of actual defective instances.
4) F-measure.The F-measure, a balanced metric considering precision and recall, further underscores the proposed model's excellence. With an F-measure of 91%, it outshines both benchmark models by 6% and 10%, demonstrating its efficacy in achieving a harmonious balance between precision and recall.
5) AUROC.The AUROC curve, measuring the model's discrimination ability, attests to the proposed model's superiority with an AUROC of 0.95, showcasing a substantial lead over RCSOLDA-RIR and WPA-PSO.
6) Robustness and generalizability.To assess the robustness and generalizability of the proposed GA-DT model, additional experiments were conducted using datasets from diverse sources and programming languages. Table2 presents the results of these experiments, reinforcing the model's consistent and exceptional performance across various contexts.
Table2Results of experiments
Table2 presents the robustness and generalizability comparison of the proposed GA-DT model with RCSOLDA-RIR and WPA-PSO models on different datasets. The baseline DT is included for reference.
4.2 Comparison with Different Feature Selection Algorithms
To further underline the significance of the proposed GA-enhanced feature selection, comparisons with different feature selection algorithms were conducted. Table3 illustrates the performance of GA-DT against models employing Recursive Feature Elimination (RFE) , Principal Component Analysis (PCA) , and Information Gain (IG) as feature selection techniques.
Table3Feature selection comparison
Table3 compares the feature selection aspects of the proposed GA-DT model with RCSOLDA-RIR and WPA-PSO models, including the number of selected features, feature selection time, and classification time.
4.3 Impact of Hybridization with PSO-ACO Algorithm
In pursuit of additional optimization, the proposed GA-DT model was hybridized with PSO and ACO algorithms. The comparative analysis of the GA-DT model and the hybridized model's performance, highlighting the significant enhancement achieved through the incorporation of PSO-ACO.
5 Conclusions
In conclusion, this study introduces a novel approach to software defect prediction by integrating machine learning techniques with optimization algorithms. Through the development of a hybrid model combining a GA for feature selection and a DT for classification, we aimed to enhance the accuracy and efficiency of defect prediction. By leveraging diverse datasets and meticulously refining the PSO-ACO algorithm, our proposed methodology demonstrates superior performance across multiple software projects. The results indicate a paradigm shift in predictive accuracy, with the GA-DT with PSO-ACO algorithm outperforming existing models. Furthermore, the exceptional performance in terms of F-measure showcases the potential of our hybrid approach in real-world software development scenarios. While certain limitations exist, such as sensitivity to dataset characteristics, the adaptability and efficiency of our model contribute significantly to its effectiveness. Future research avenues could explore multilingual projects, dynamic feature selection, and ensemble techniques to further enhance software defect prediction methodologies. Overall, this study contributes to advancing the field of software defect prediction, with implications for improving software quality and reliability in diverse development environments.