Continuous speech separation (CSS) is a recently proposed framework which aims at separating each speaker from an input mixture signal in a streaming fashion. Hereafter we perform an evaluation study on practical design considerations for a CSS system, addressing important aspects which have been neglected in recent works. In particular, we focus on the trade-off between separation performance, computational requirements and output latency showing how an offline separation algorithm can be used to perform CSS with a desired latency. We carry out an extensive analysis on the choice of CSS processing window size and hop size on sparsely overlapped data. We find out that the best trade-off between computational burden and performance is obtained for a window of 5 s.
Authors:
Morrone, Giovanni; Cornell, Samuele; Zovato, Enrico; Brutti, Alessio; Squartini, Stefano
Affiliations:
Università Politecnica delle Marche, Ancona, Italy; PerVoice S.p.A., Trento, Italy; Fondazione Bruno Kessler, Trento, Italy(See document for exact affiliation information.)
AES Convention:
152 (May 2022)
Paper Number:
10562
Publication Date:
May 2, 2022
Subject:
Television Audio
Download Now (471 KB)
This paper is Open Access which means you can download it for free.
No AES members have commented on this paper yet.
To be notified of new comments on this paper you can subscribe to this RSS feed. Forum users should login to see additional options.
If you are not yet an AES member and have something important to say about this paper then we urge you to join the AES today and make your voice heard. You can join online today by clicking here.