Abstract
In recent years, Speech Emotion Recognition (SER) has developed into an essential instrument for interpreting human emotions from auditory data. The proposed research focuses on the development of a SER system employing deep learning and multiple datasets containing samples of emotive speech. The primary objective of this research endeavor is to investigate the utilization of Convolutional Neural Networks (CNNs) in the process of sound feature extraction. Stretching, pitch manipulation, and noise injection are a few of the techniques utilized in this study to improve the data quality. Feature extraction methods including Zero Crossing Rate, Chroma_stft, Mel-scale Frequency Cepstral Coefficients(MFCC), Root Mean Square(RMS), and Mel-Spectogram are used to train a model. By using these techniques, audio signals can be transformed into recognized features that can be utilized to train the model. Ultimately, the study produces a thorough evaluation of the model's performance. When this method was applied, the model achieved an impressive accuracy of 94.57% on the test dataset. The proposed work was also validated on the EMO-BD and IEMOCAP datasets. These consist of further data augmentation, feature engineering, and hyperparameter optimization. By following these development paths, SER systems will be able to be implemented in real-world scenarios with greater accuracy and resilience.
Keywords
0 Introduction
Acoustic signals are created by the human vocal tract during the process of speech generation. Speech signals are acoustic signals[1]. In the form of sound waves, they carry information that is used to express spoken language. Examples of words, moods, intonations, and other linguistic elements are provided in this information. The waveform of these signals is what distinguishes them, moreover, it reflects changes in air pressure caused by the movement of the vocal cords, the articulation of the tongue, lips, and palate, and the modulation of airflow through the vocal tract. These signals' distinctive features are brought about by these variations in air pressure. Since their inception, voice signals have been primarily analyzed through the lens of signal processing. This discipline employs a diverse range of methodologies to extract relevant data from the signals that are being analyzed. To comprehend and discern spoken language, one must possess knowledge of the attributes associated with pitch, intensity, frequency, and temporal patterns. Possessing these attributes is crucial for understanding communication. The aforementioned qualities are inherent in this substance. Speech signals serve as the auditory manifestation of spoken words, providing insights into the intricate physiological processes that contribute to their generation. The respiratory system, in conjunction with the articulators (mouth, lips, and palate) and vocal cords, generates these transmissions. Speech signals are generated through the modulation of ventilation and air pressure. These signals utilize sound vibrations to transmit linguistic information. Waveform structures, frequencies, amplitudes, and temporal configurations contribute to the intricacy and profundity of spoken communication. To comprehend speech, signal processing analyses and extracts voice characteristics. The processing of these signals yields the spectral characteristics, including intensity, intonation, and content. These attributes exhibit the frequency, intensity, and frequency components of the speaker. The interpretation of speech signals is complicated by the modifications that cadence, pauses, and temporal variations introduce. Emotion information is conveyed through voice signals in voice emotion recognition[2]. Through variations in pitch, intonation, tempo, and other acoustic components, speech conveys emotion. Therefore, the analysis of speech signals is critical to comprehend these emotional indicators and to identify and categorize emotional states expressed through spoken language. By extracting features that capture these emotional shifts, models capable of identifying and classifying speech emotions can be developed. Within the domain of Speech Emotion Recognition (SER) , these signals convey emotional indicators that may be identified and classified utilizing machine learning and signal processing methodologies. This allows for the identification of underlying emotions that are communicated through spoken words. Since emotions are frequently reflected in variations in pitch, tone, strength, and other acoustic aspects of speech signals, these characteristics are essential for the study of emotions[3].
Convolutional Neural Network (CNN) : The CNN is a type of deep neural network that is utilized for the deep learning methodology[4]. Fig.1 shows the working of a CNN model for classification tasks.
Fig.1Convolutional neural network
The evaluation of speech signals, graphics, and signals is the primary applications of this tool. To identify patterns in the information that they were receiving, whether it was photos, sounds, sights, or position data, CNN attempted to function in a manner that was analogous to the way the human nervous system operates. In most cases, the structure of a CNN is composed of multiple layers that are stacked one on top of the other. Every layer has its unique approach to managing the data. In a multi-layered system, each layer uses the information from the previous layer to process and refine the data further. Naturally, each CNN can have more than one layer, and each layer can have its own set of parameters. Both of these possibilities are possible. To improve CNN's functionality, it is necessary to arrange all of these layers in the appropriate sequence and to configure their features appropriately[5]. The paper is structured as follows: In Section 1, the related work is presented, and in Section 2, the methodology for the datasets, features extracted, and model employed are discussed. The comparative results and general comments are presented in Section 3. The paper concludes with Section 4.
1 Related Works
Krishna et al.[6] employed Support Vector Machine (SVM) and Multi-Layer Perception (MLP) classifiers along with Mel-Frequency Cepstral Coefficients (MFCC) , MEL, chroma, and Tonnetz audio features for emotion recognition, achieving an accuracy of 86.5%. Khalil[7] provides an overview of deep learning techniques applied in SER literature without focusing on a specific model. Anusha et al.[8] utilized classifiers trained on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset with features such as MFCC, Mel-Spectrogram, and chroma for SER. While Wani et al.[9] conducted a comprehensive review of SER systems and methodologies without a specific focus on a singular model. Arun et al.[10] explored diverse machine learning models for emotion recognition in speech, utilizing various feature sets across Indian languages, aiming to identify optimal model-feature combinations for detecting emotions, including sarcasm. In Ref.[11], various machine learning models including SVM, Long Short-Term Memory network (LSTM) , random forests, and CNNs were employed for emotion classification in speech signals, with the 2D CNN model achieving the highest accuracy of around 70% on the testing dataset. In Ref. [12], MFCC, along with pitch and Short-Term Energy (STE) features, were used with an SVM classifier for emotion classification in North American English speech datasets. Kumar et al.[13] employed deep learning techniques for speech emotion recognition based on feature extraction and model creation. Ref. [14] implemented a deep learning-based system for emotion detection in speech signals, achieving an efficiency rate of 81.82%. Ref.[15] presented an emotion detection system for speech signals, validated with a dataset of 250 utterances from two Chinese female speakers. Ref.[16] introduced a Deep Neural Network (DNN) architecture for SER achieving a96.97% accuracy on the Berlin Database of Emotional Speech (3 class subset) . Cherif et al.[17] employed machine learning-based models, CNNs, LSTM, and Bidirectional Long Short-Term Memory (BiLSTM) for speech emotion recognition in the Algerian dialect, achieving a top accuracy of 93.34% with the LSTM-CNN model on their collected dataset. Yoon et al.[18] introduced a novel deep dual recurrent encoder model that combines text and audio signals for SER, outperforming prior methods with accuracies between 68.8% to 71.8% on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. Ramdinmawii et al.[19] explored emotion recognition in speech signals using signal processing methods, analyzing four basic emotions (anger, happy, fear, neutral) by extracting features like F0, formants, dominant frequencies, ZCR, and signal energy. The study cross-validates findings across German and Telugu emotion databases, revealing distinct differences between emotions, particularly in high-arousal states, providing insights for diverse applications. Ref. [20] introduced an end-to-end SER system employing multi-level acoustic data and a unique co-attention module, achieving competitive results on the IEMOCAP dataset. The model leveraged MFCC, spectrogram, and high-level acoustic data extracted through CNN, BiLSTM, and wav2vec2, respectively, fused them using a proposed co-attention mechanism. In Ref. [21], Mountzouris et al.used six deep learning networks: Deep Belief Network (DBN) , DNN, LSTM, LSTM with Attention Mechanism (LSTM-ATN) , CNN, and CNN with Attention Mechanism (CNN-ATN) . Mountzouris trained and evaluated on the Surrey Audio-Visual Expressed Emotion (SAVEE) and RAVDESS databases, the models incorporated techniques like dropout and batch normalization for improved generalization and faster training. Results indicated that models with attention mechanisms outperformed others, with CNN-ATN achieving the highest accuracy of 74% for SAVEE and 77% for RAVDESS, surpassing existing state-of-the-art systems for these datasets.
2 Methodology
2.1 Dataset
Fig.2 shows count versus emotions for the used dataset. For training and evaluating the SER model, the project makes use of several diverse datasets. These datasets provide a large variety of emotional speech examples, which provides a bank of audio samples that is both rich and varied and can be used for analysis and categorization.
The RAVDESS is the first dataset that is utilized. The audio recordings in this compilation feature a variety of performers articulating premeditated words that symbolize an extensive spectrum of emotions. Each audio sample in RAVDESS is labeled with an emotion representing one of eight distinct sentiments. The vocal samples are annotated with labels that function as symbolic representations of these emotions. This class of sentiments comprises, among others, feelings of tranquility, joy, sorrow, anger, fear, revulsion, and awe. An exhaustive dataset about emotion recognition is presented, encompassing variables such as utterances and emotional intensity that are encoded within the filenames.
Fig.2Count of emotions
The study also makes use of the Crowd-sourced Emotional Multimodal Actors Dataset, also known as CREMA-D. The audio recordings in this dataset depict a variety of emotional states through the use of expressive facial features; these states include melancholy, rage, contempt, dread, happiness, and neutrality. These expressions embody particular emotional states. Several instances of emotive speech elicited by the actors' performances are included in the dataset. Furthermore, this contributes an additional level of genuineness and inherent diversity to the emotional expressions. Additionally, a variety of emotional states are represented in the audio samples accessible via the SAVEE collection. These emotions include surprise, anger, disgust, fear, neutrality, and sorrow, among others. SAVEE is composed of spoken word recordings from extracted audio samples where the listener can be access to an extensive array of emotions expressed through vocalizations. Audio recordings of emotional states constitute the final component of the Toronto Emotional Speech Set (TESS) collection. The majority of these expressions are comprised of surprise and other emotional states. TESS makes multiple contributions to the overall dataset pool, including the inclusion of unexpected instances and a variety of emotional expressions. This has led to a greater extent of the datasets. The amalgamation of these datasets yields an extensive and varied compilation of instances of impassioned speech. This compilation encompasses an extensive spectrum of affective states that are capable of being articulated orally. The versatility and reliability of the SER model, which underwent training using this compilation of datasets, are due to the genuineness of the recordings and the extensive array of emotional expressions employed. The model's capacity to extrapolate its findings is impacted by both factors.
To evaluate the reliability, the proposed work validates the on-Berlin Database of Emotional Speech (EMO-BD) [22] and IEMOCAP[23] datasets. The EMO-BD dataset is consisted of 535 emotional speech files. The dataset, which includes anxiety, happiness, neutrality, disgust, sadness, boredom, and anger, was recorded by five male and five female professional speakers, and is widely used for SER purposes by researchers. The IEMOCAP dataset, consisted of five sessions with male and female speakers, is used to analyze emotions. The dataset aggregates excited utterances into happy categories, considering four distinct emotion classes: angry, happy, neutral, and sad. The results show a distribution of 1103 angry utterances, 1636 happy utterances, 1708 neutral utterances, and 1084 sad utterances.
2.2 Features Extracted
To obtain crucial information for the emotion detection process, a substantial collection of features is extracted from the audio signals. To analyze feature importance in emotion recognition, employ techniques such as permutation importance, SHAP (SHapley Additive exPlanations) values, and partial dependence plots. Permutation importance assesses each feature's impact by shuffling values and observing performance changes. SHAP values offer insights into individual feature contributions to predictions. Partial dependence plots visualize feature-emotion relationships, additionally, conducts statistical tests for feature significance. Rank features based on these analyses to identify the most influential ones, and uses findings to guide future feature engineering efforts, focusing on enhancing crucial features or exploring new representations. Validation through cross-validation ensures robustness.Results should be clearly documented and reported to facilitate understanding and guide further research. This systematic approach provides insights into feature importance, aiding in optimizing emotion recognition models. The purpose of these attributes is to encapsulate the distinctive qualities of speech signals that communicate nuanced emotional information. They comprise a variety of the constituent elements of the audio data.
The following are the most significant characteristics that were extracted.
2.2.1 Zero Crossing Rate (ZCR) [24]
The rate at which the audio signal's sign changes is determined by this function, and this information can be used to understand the waveform's temporal fluctuations and transitions.
Sign changes indicate rapid shifts in the waveform, which can correlate with abrupt changes in emotional expression such as sudden outbursts, transitions between emotions, or changes in vocal intensity. For example, a higher ZCR may indicate a more dynamic and expressive vocal delivery, which could be associated with emotions like excitement or agitation.
2.2.2 Chroma_stft
Extracted from the Short-Time Fourier Transform (STFT) , this feature captures the spectral content of the audio signal in different musical pitch classes, providing information about tonal qualities and musical content.
Emotions often manifest through changes in tonal qualities, such as pitch variations or musical content. Chroma_stft helps in capturing these musical features, providing insight into the melodic aspects of vocal expression. For instance, shifts in pitch or melody can convey emotions such as joy, sadness, or tension.
2.2.3 Mel Frequency Cepstral Coefficients (MFCCs) [25]
MFCCs are a depiction of the audio signal's spectral characteristics that focus primarily on the frequency ranges that are perceptible to humans. Similar to how the human hearing system responds to environmental sounds, MFCCs can detect important frequency components.
MFCCs are particularly effective in capturing the nuanced variations in vocal timbre and texture that are indicative of different emotional states. They help in capturing the subtle differences in vocal tone, resonance, and articulation that accompany various emotions. For example, changes in the distribution of MFCCs may reflect variations in vocal tension, which could correspond to emotions such as anger or fear.
2.2.4 Root Mean Square (RMS) value[26]
The audio signal's total amplitude or energy is represented by the Root-Mean-Square (RMS) value, which also provides information about the signal’s loudness or intensity.
Emotions often manifest through variations in vocal intensity, ranging from whispers to loud exclamations. RMS helps in quantifying these variations in loudness, which can be indicative of emotional arousal or expressiveness. For instance, higher RMS values may indicate heightened emotional intensity, potentially corresponding to emotions such as anger or excitement.
2.2.5 Mel-Spectogram[27]
This feature highlights the distribution of spectrum energy by providing a visual representation of the frequencies across time. It is derived from the mel-frequency spectrogram. Figs.3 and 4 focus on the spectrogram for audio with sad and happy emotions respectively, Figs.5 and 6 show the spectrogram for audio with fear and angry emotions respectively. In emotion recognition tasks using voice signals, each of the mentioned features plays a crucial role in capturing different aspects of the audio signal related to emotional expression.
Fig.3Spectrogram for audio with sad emotion
Fig.4Spectrogram for audio with happy emotion
Fig.5Spectrogram for audio with fear emotion
Fig.6Spectrogram for audio with angry emotion
Mel-spectrogram offers a comprehensive overview of the spectral characteristics of the audio signal, capturing both temporal and frequency dynamics. It helps in identifying patterns and structures in the audio signal that correspond to different emotional expressions. For example, distinct patterns in the mel-spectrogram may correspond to specific emotional states, such as the presence of high-frequency energy bursts associated with fear or the presence of lower-frequency components associated with sadness. When these features are merged, the audio signals are represented in a diverse manner. This representation encompasses temporal and spectral elements, in addition to aspects related to intensity. They enable the model to identify and classify emotions more effectively since they have essential qualities that represent emotions that are expressed through speech. Furthermore, the dataset is improved through the application of augmentation techniques such as stretching, pitch modulation, and noise injection. These methods strengthen the model's ability to handle variations and enhance its generalization capabilities.
2.3 Model Employed
The main model for emotion categorization based on auditory inputs is a CNN architecture. Throughout the process, this model serves as the main model. The goal of the CNN model is to process and automatically learn features from the input audio.This is achieved by the application of its hierarchical architecture, which makes it easier to find patterns and connections between the collected characteristics. From the audio spectrograms, the CNN architecture is composed of numerous layers that are designed to extract and abstract information that is pertinent to the problem at hand. All of these levels typically consist of the following layers.
1) Convolutional layers: By applying filters to the input spectrogram, these layers can identify a wide variety of characteristics and patterns included within the audio signals.
2) Max-pooling layers: After the convolutional layers, the max-pooling layers downsample the learned features, thereby lowering the dimensionality and concentrating on the most significant information.
3) Dropout layers: There are dropout layers that are incorporated to prevent overfitting. These layers randomly deactivate a portion of the neurons during training, which encourages the model to generalize more effectively.
4) Fully connected layers: The retrieved features and patterns are included in these layers, which are then used for the final classification into several categories of emotions. The CNN model architecture that was utilized for the project most likely consists of numerous convolutional blocks, each of which includes convolutional, pooling, and potentially dropout layers, followed by dense layers for classification. When doing multi-class classification, it is a usual practice to use activation functions such as ReLU (Rectified Linear Unit) in the convolutional layers and softmax in the final output layer.
The proposed CNN model consists of a sequential stack of several layers with used parameters as follows.
1) Input layer: The input shape is determined by the ‘input_shape’ parameter, which is ‘ (x_train.shape[1], 1) ’. It implies that the input data consists of sequences with a single feature.
2) Convolutional layers: There are four convolutional layers added sequentially. Each convolutional layer has a different number of filters and kernel sizes. The first convolutional layer has 256 filters with a kernel size of 5, followed by the subsequent layers with 256, 128, and 64 filters respectively. The activation function used in each convolutional layer is ReLU, which introduces non-linearity into the model and helps in capturing complex patterns in the data. Padding is set to ‘same’, which ensures that the output size remains the same as the input size.
3) Max pooling layers: After each convolutional layer, there is a max-pooling layer. Max-pooling is used for downsampling the feature maps, reducing computational complexity, and helping the model to focus on the most important features. Each max-pooling layer has a pool size of 5 and a stride of 2, which means it takes the maximum value within a window of size5 and moves by 2 steps at a time. Padding is set to ‘same’ to ensure that the output size remains consistent.
4) Dropout layers: Two dropout layers are added to prevent overfitting. Dropout randomly sets a fraction of input units to zero during training, which helps in reducing overfitting by preventing the network from relying too much on specific activations.
5) Flatten layer: This layer flattens the output of the previous layer into a one-dimensional array, which is required before passing it to the fully connected layers.
6) Dense layers: There are two fully connected dense layers. The first dense layer has 32 units and utilizes the ReLU activation function. The second dense layer has 8 units with a softmax activation function, which is used for multi-class classification tasks. Softmax normalizes the output into a probability distribution over the 8 classes.
7) Compilation: The model is compiled with the Adam optimizer, categorical cross-entropy loss function (suitable for multi-class classification) , and accuracy as the evaluation metric.
8) Callbacks: A ReduceLROnPlateau callback is used to reduce the learning rate when the training loss plateaus. It monitors the loss, and if the loss does not decrease for a certain number of epochs (patience) , it reduces the learning rate by a factor specified (0.4 in this case) until it reaches the minimum specified learning rate (0.0000001) .
Table1 shows the different layers used in the model. Overall, this model architecture leverages convolutional layers for feature extraction from sequential data, max-pooling layers for down-sampling, dropout layers for regularization, and fully connected layers for classification. The ReduceLROnPlateau callback helps in optimizing the learning process by adjusting the learning rate dynamically during training. Additionally, optimization strategies such as ReduceLROnPlateau, which allows for the adjustment of learning rates and categorical cross-entropy loss functions, in conjunction with the Adam optimizer, are frequently utilized to effectively train the model and optimize its performance in recognizing emotions based on the audio features.
Table1Architecture of used CNN model
3 Results and Discussion
Fig.7 shows a graph for training and testing loss. Several encouraging findings were obtained by the SER model in terms of identifying and categorizing emotions based on speech signals. After being evaluated, the model demonstrated an accuracy of roughly 94.57% across the board when applied to the demonstration dataset. There were notable variations in the categorization performance amongst the different emotional categories. This difference was noteworthy enough to mention. The results showed that some emotions were more accurately classified than others, with rage and surprise being two examples of these emotions. Performance is constantly fluctuating, which is in line with the intrinsic complexity of the task of differentiating between distinct emotional expressions seen in speech signals.
The model showed a higher degree of accuracy in differentiating between surprise and anger. This could be explained by the distinct aural qualities associated with different emotional states. Pitch, tone, and intensity changes are a few characteristics that characterize this occurrence. Nevertheless, the task of differentiating discrete emotions was complicated by factors such as neutrality and slight variations in facial expressions of emotion. Because some emotions may share more auditory characteristics than others, it may be more difficult to distinguish between them solely by listening to their sounds.
Fig.7Training and testing results for number of epochs with the loss and accuracy representations
The confusion matrix between predicted and actual labels is illustrated in Fig.8. The model's ability to accurately discern emotions from a variety of speech instances demonstrated its generalizability. This outcome was attained through the attainment of extreme precision. The fact that the accuracy remains only moderately accurate suggests that there are possible avenues for enhancement. This implies that to enhance the model's ability to differentiate between different emotional states via application, it might be imperative to refine feature extraction methods, investigate more complex network architectures, or consider novel augmentation techniques.
Fig.8Confusion matrix of the proposed work
In addition, the incorporation of additional modalities or contextual information could potentially improve the model's capability of discerning intricate emotional signals concealed within speech signals. This is due to the model's failure to consider the context of the speech signals. Further research initiatives are warranted in the domain of speech emotion recognition, which has the potential to yield substantial advancements and enhancements. In addition to the CNNs utilized in this research, an opportunity arises to investigate more complex neural network topologies such as transformer-based models or recurrent neural networks, which can improve the model's comprehension of the affective indicators inherent in speech signals. By investigating more sophisticated techniques for feature extraction and process selection, it is possible to further improve the model's ability to identify subtle fluctuations in emotional state. By integrating various modalities, such as facial expressions or physiological data, the model's comprehension of emotions could potentially be enhanced through the implementation of multimodal fusion techniques. This would be a positive step to take. Combining datasets with a wide range of ethnicities, languages, and cultural backgrounds could help achieve this. This would increase the model's inclusiveness and applicability. To enhance reproducibility and applicability provide specific details on hyperparameter optimization in the study, and outline the hyperparameters tuned, such as learning rate, batch size, and regularization strength, along with their ranges or distributions. The optimization method used such as grid search, random search, or Bayesian optimization plays a crucial role in selecting the best hyperparameters for model performance. The impact of hyperparameters on model performance metrics, such as accuracy or F1 score is shown in Table2, include any trade-offs or interactions observed between hyperparameters. By the hyperparameter tuning process, it can replicate and adapt the methodology effectively, fostering transparency and facilitating further advancements in the field of emotion recognition. Notwithstanding the advances that have been achieved, some challenges remain to be solved. A few of these challenges are the dataset's constraints, the subjectivity associated with emotion classification, and the challenge of differentiating between emotions that are closely related to one another.Consequently, this will aid in the advancement of SER, thereby unlocking its potential in numerous field applications. This objective can be addressed by tackling these problems by applying improved techniques and gaining a more profound understanding of the emotional cues contained in speech signals.
The study compares with the suggested approach, which employs deep learning with CNN, as shown in Table2. This analysis evaluates the proposed technique in several previous research attempts. The evaluation metrics employed are accuracy, F1 score, and processing time. The proposed method, utilizing CNNs, obtains an accuracy of 94.57% and an impressive F1 score of 0.947, with a processing time of 12 ms. By contrast, Krishna et al.[6] employed SVM and MLP with diverse features. Their accuracy reached 86.50% and their F1 score was 0.850, all accomplished within a processing time of 20 ms. Mittal et al.[11] combined SVM, LSTM, Random Forests, and CNNs, resulting in a processing time of 25 ms, an accuracy of 70.00 %, and an F1 score of 0.680. Babu et al.[14] employed Librosa for deep learning and achieved an accuracy of 81.82%, an F1 score of 0.800, and a processing time of 15 ms. Cherif et al.[17] employed machine learning models, CNNs, and LSTM and achieved a high accuracy of 93.34%, an F1 score of 0.920, and a processing time of 22 ms. Yoon et al.[18] utilized a deep dual recurrent encoder model, achieving a processing time of 30 ms, an accuracy of 68.80%, and an F1 score of 0.670. Mountzouris et al.[21] employed DBN, DNN, LSTM, LSTM-ATN, CNN, CNN-ATN and CNN-ATN achieved a high accuracy of 74.00% for SAVEE and 77.00% for RAVDESS, an F1 score of 0.726, and a processing time of 28 ms. As shown in the table, the suggested method demonstrates remarkable performance in terms of F1 score and processing time, rendering it a feasible choice in some situations.
Table2Comparative results of proposed methodology with different state-of-the-art networks
The proposed research presents a comprehensive investigation into SER utilizing deep learning techniques across multiple datasets. Our study primarily focuses on the utilization of CNNs for sound feature extraction, augmented by various techniques such as stretching, pitch manipulation, and noise injection to enhance data quality. We aim to provide an in-depth analysis of these methods and their impact on SER performance which is shown in Table3. In the proposed research, a CNN-based feature extraction method with data augmentation techniques achieved impressive performance with 95.67% accuracy on the EMO-BD dataset and 84.21% on the IEMOCAP dataset. The method proposed by Ref. [28] combined SSL models and spectral features through MoE, achieving a weighted accuracy of 73.91% and an unweighted accuracy of 72.29%, addressing the domain shift problem in SER. Similarly, Ref. [29] utilized multi-resolution variational mode decomposition, achieving 90.51% accuracy, while facing challenges like dataset dependency and potential overfitting. Additionally, Ref. [30] introduced a fusion of spectral and temporal features using CNNs and a convolution layer-based transformer, obtaining 94.20% accuracy on EMO-BD and 81.10% on IEMOCAP datasets, highlighting computational complexity as a challenge. However, the proposed research stands out due to its superior performance, achieving notably higher accuracy rates on both datasets, surpassing the limitations encountered in the other studies, and demonstrating the effectiveness of the CNN-based feature extraction method with data augmentation techniques in enhancing SER.
Table3Comparison on different methods with proposed method on EMO-BD and IEMOCAP dataset
4 Conclusions
Convolutional neural networks have been applied to voice emotion recognition with great success, allowing for the identification and categorization of emotions from voice input. The emotion recognition technology has enabled these advancements. It was easier to develop a strong model that could identify a range of emotional states based on aural inputs by using many datasets that covered a wide range of emotional expressions. The fact that the datasets included a variety of emotional expressions allowed for this. Among the features extracted from the audio signals were MFCCs, RMS values, zero crossing rate, Chromist, and mel-spectrograms. These attributes provided a comprehensive elucidation of the auditory signals. These attributes resulted in the accumulation of noteworthy characteristics that are linked to an assortment of emotions. Pitch modulation, stretching, and noise injection were among the augmentation techniques implemented to increase the model's capacity to generalize across a broad spectrum of affective expressions. The diversity of the dataset was expanded as a result of the implementation of these strategies. Despite attaining a respectable accuracy of around 94.57% on the test dataset, the model demonstrated substantial variability in its performance across various emotional categories. The proposed work attained an accuracy of 95.67% and 84.21% on the validation datasets of the EMO-BD and IEMOCAP datasets, respectively. Even though it achieved such a high degree of accuracy. It is particularly effective at reliably classifying certain emotions to a greater extent than others. This exemplifies the necessity for further refinement and expansion of the model's discriminatory capabilities, particularly about discerning dim or closely associated emotional states. To augment the precision of emotion identification, prospective domains of research might encompass the examination of improved techniques for feature extraction, alternative architectures for networks, or multimodal methodologies that integrate contextual comprehension. Furthermore, the integration of more extensive and varied datasets, along with the enhancement of augmentation techniques, may potentially augment the model's capacity to understand and classify intricate emotional expressions conveyed in speech signals. Overall, the SER model represents a positive advancement in the direction of automated emotion recognition from speech. To more accurately capture the intricacies of human emotional indicators conveyed via audio signals, ongoing advancements and improvements are required.