AES Journal Forum

Information Extraction and Noisy Feature Pruning for Mandarin Speech Recognition

Document Thumbnail

The Transformer network has two drawbacks in Automatic Speech Recognition (ASR) tasks. One is that the global features are mainly focused and other useful features, such as local features, are neglected. The other is that it is not robust to the noisy audio signal. In order to improve the model performance in ASR tasks, useful information extraction and noise removal are the main concerns. First, an information extraction module, abbreviated as IE module, is proposed to extract the local context information from the integration of previous layers which contain both low-level information and high-level information. Moreover, a noisy feature pruning (NFP) module is proposed to ease the negative effect caused by noisy audio. Finally, a network called EPT-Net is proposed on the basis of the integration of IE module, NFP module and the Transformer network. Empirical evaluations have been conducted mainly by using two widely used Chinese Mandarin datasets, which are Aishell-1 and HKUST. Experimental results can validate the effectiveness of EPT-Net, whose character error rate (CER) are 5.3%/5.6% of dev/test and 21.9% of dev in these two datasets respectively.

JAES Volume 72 Issue 1/2 pp. 59-70; January 2024
Publication Date:

Click to purchase paper as a non-member or you can login as an AES member to see more options.

No AES members have commented on this paper yet.

Subscribe to this discussion

RSS Feed To be notified of new comments on this paper you can subscribe to this RSS feed. Forum users should login to see additional options.

Start a discussion!

If you would like to start a discussion about this paper and are an AES member then you can login here:

If you are not yet an AES member and have something important to say about this paper then we urge you to join the AES today and make your voice heard. You can join online today by clicking here.

AES - Audio Engineering Society