In this paper we propose a new representation as input of a Convolutional Neural Network in the goal of detecting music structure boundaries. For this task, previous works used a late-fusion of a Mel-scaled Log-Magnitude Spectrograms (MLS) and a lag matrices networks. We propose here to use several self-similarity-matrices, each representing different audio descriptors, and combined using the depth of the input layer. We show that this representation improve the results over the use of the lag-matrix. We also show that using the depth of the input layer provide a convenient way for early fusion of representations.
Cohen-Hadria, Alice; Peeters, Geoffroy
Affiliation: IRCAM, Paris, France
AES Conference: 2017 AES International Conference on Semantic Audio (June 2017)
Paper Number: 5-3
Publication Date: June 13, 2017
Subject: Deep Learning
No AES members have commented on this paper yet.
If you are not yet an AES member and have something important to say about this paper then we urge you to join the AES today and make your voice heard. You can join online today by clicking here.