Existing methods to index and search audio documents are generally based on text metadata and text-based search engines, but this approach is often problematic and time consuming because the text label does not necessarily describe the audio content. Query by Example (QBE) is an alternative approach for improving the effectiveness and efficiency of sound retrieval. In this research, the authors propose a novel approach for sound query by vocal imitation. Vocal imitation is commonly used in human communication and can be employed for human-computer interaction. Two proposals are suggested: (1) a supervised system that trains a multiclass classifier using training vocal imitations of different sound classes in the library and classifies a new imitation query into one of the classes; (2) an unsupervised system that is more flexible because it measures the feature distance between the imitation query and each sound in the library, returning sounds most similar to the query. Such systems require an effective feature representation of imitation queries and sounds in the library. Existing handcrafted audio features may not work well given the variety of vocal imitations and the mismatch between vocal imitations and actual sounds. It is therefore proposed to learn feature representations from training vocal imitations automatically using a Stacked Auto-Encoder (SAE). Experiments show that sound retrieving performance by automatically learned features outperform those using carefully handcrafted features in both supervised and unsupervised settings.
Zhang, Yichi; Duan, Zhiyao
Affiliation: Dept. of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA
JAES Volume 64 Issue 7/8 pp. 533-543; July 2016
Publication Date: August 11, 2016
No AES members have commented on this paper yet.
If you are not yet an AES member and have something important to say about this paper then we urge you to join the AES today and make your voice heard. You can join online today by clicking here.