The aim of this study was to evaluate the suitability of 2D audio signal feature maps for speech recognition based on deep learning. The proposed methodology employs a convolutional neural network (CNN), which is a class of deep, feed-forward artificial neural network. The authors analyzed the audio signal feature maps, namely spectrograms, linear and Mel-scale cepstrograms, and chromagrams. This choice was made because CNN performs well in 2D data-oriented processing contexts. Feature maps were employed in a Lithuanian word-recognition task. The spectral analysis led to the highest word recognition rate. Spectral and mel-scale cepstral feature spaces outperform linear cepstra and chroma. The 111-word classification experiment depicts f1 score of 0.99 for spectrum, 0.91 for mel-scale cepstrum , 0.76 for chromagram, and 0.64 for cepstrum feature space on test data set.
Korvel, Gražina; Treigys, Povilas; Tamulevicus, Gintautas; Bernataviciene, Jolita; Kostek, Bozena
Affiliations: Institute of Data Science and Digital Technologies, Vilnius University, Vilnius, Lithuania; Audio Acoustics Laboratory, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Gdansk, Poland(See document for exact affiliation information.)
JAES Volume 66 Issue 12 pp. 1072-1081; December 2018
Publication Date: December 20, 2018
No AES members have commented on this report yet.
If you are not yet an AES member and have something important to say about this report then we urge you to join the AES today and make your voice heard. You can join online today by clicking here.