Improved MFCC features and TWM model for speech emotion recognition

Liyan Zhang; Jiaxin Du; Shuang Chen; Jiayan Li

Please submit manuscripts in either of the following two submission systems

ScholarOne Manuscripts

ScholarOne

勤云稿件系统

Search by Issue

Search by Keywords

News & AnnouncementMORE

【03-29】2015 Outstanding Reviewers
【03-27】2014 Outstanding Reviewers
【02-18】2013 Outstanding Reviewers
【12-29】The First Outstanding Reviewers
【05-04】Copyright Transfer Agreement
【04-04】To authors

Supervised by Ministry of Industry and Information Technology of The People's Republic of China Sponsored by Harbin Institute of Technology Editor-in-chief Yu Zhou ISSNISSN 1005-9113 CNCN 23-1378/T

期刊网站二维码

微信公众号二维码

Related citation:

【Print】【HTML】【PDF download】【View/Add Comment】【Download reader】【 Close 】

Back Issue Advanced Search

This paper has been: browsed 200times downloaded 168times
Shared by: Wechat More Font:larger+\|default\|smaller-
Improved MFCC features and TWM model for speech emotion recognition

Author Name	Affiliation	Postcode
Liyan Zhang^*	School of Railway Intelligent Engineering, Dalian Jiaotong University, Dalian 116028, Liaoning, China	116028
Jiaxin Du	School of Railway Intelligent Engineering, Dalian Jiaotong University, Dalian 116028, Liaoning, China
Shuang Chen	School of Railway Intelligent Engineering, Dalian Jiaotong University, Dalian 116028, Liaoning, China
Jiayan Li	School of Railway Intelligent Engineering, Dalian Jiaotong University, Dalian 116028, Liaoning, China

Abstract:

To solve the problem that traditional MFCC features cannot fully represent dynamic speech features, this paper introduces first-order and second-order differencing on the basis of static MFCC features to extract dynamic MFCC features, and constructs a hybrid model (TWM) combining multi head attention mechanism and improved Wasserstein Generative Adversarial Network (WGAN-GP) on the basis of TIM-NET network. Among them, the multi head attention mechanism not only effectively prevents gradient vanishing, but also allows for the construction of deeper networks that can capture long-range dependencies and learn from information at different time steps, improving the accuracy of the model; WGAN-GP solves the problem of insufficient sample size by improving the quality of speech sample generation. The experimental results show that this method significantly improves the accuracy and robustness of speech emotion recognition on RAVDESS and EMO-DB datasets.

Key words: Dynamic features Speech emotion recognition Multi head attention mechanism Generative Adversarial Networks

DOI：10.11916/j.issn.1005-9113.24051

Clc Number:TP183

Fund:

Descriptions in Chinese:

为了解决传统MFCC特征不能完全表示动态语音特征的问题，本文在静态MFCC特征的基础上引入了一阶和二阶差分来提取动态MFCC特征，并在TIM-NET网络的基础上构建了一个结合多头注意机制和改进的Wasserstein生成对抗网络（WGAN-GP）的混合模型（TWM）；WGAN-GP通过提高语音样本生成的质量来解决样本量不足的问题。实验结果表明，该方法显著提高了RAVDESS和EMO-DB数据集上语音情感识别的准确性和鲁棒性。

Search by Issue

Search by Keywords

News & AnnouncementMORE

LINKS