AES Journal Forum

Analysis of 2D Feature Spaces for Deep Learning-Based Speech Recognition

Document Thumbnail

The aim of this study was to evaluate the suitability of 2D audio signal feature maps for speech recognition based on deep learning. The proposed methodology employs a convolutional neural network (CNN), which is a class of deep, feed-forward artificial neural network. The authors analyzed the audio signal feature maps, namely spectrograms, linear and Mel-scale cepstrograms, and chromagrams. This choice was made because CNN performs well in 2D data-oriented processing contexts. Feature maps were employed in a Lithuanian word-recognition task. The spectral analysis led to the highest word recognition rate. Spectral and mel-scale cepstral feature spaces outperform linear cepstra and chroma. The 111-word classification experiment depicts f1 score of 0.99 for spectrum, 0.91 for mel-scale cepstrum , 0.76 for chromagram, and 0.64 for cepstrum feature space on test data set.

JAES Volume 66 Issue 12 pp. 1072-1081; December 2018
Publication Date:

Click to purchase paper as a non-member or you can login as an AES member to see more options.

No AES members have commented on this report yet.

Subscribe to this discussion

RSS Feed To be notified of new comments on this report you can subscribe to this RSS feed. Forum users should login to see additional options.

Start a discussion!

If you would like to start a discussion about this report and are an AES member then you can login here:

If you are not yet an AES member and have something important to say about this report then we urge you to join the AES today and make your voice heard. You can join online today by clicking here.

AES - Audio Engineering Society