In the classic signal processing context, the ability to identify and resolve acoustic objects from a compact and small number of directional microphones is a challenging problem. A practical example is developing a robust system for understanding voice activity in a reverberant conference room from a small number of co-incident directional microphones. In an application setting, many assumptions of the classic academic problem formulation are violated. The actual problem is inherently broad band with a wide dynamic range, simultaneous voice activity and multi-path acoustic responses leading to source correlation and ambiguity. Room and occupant noise is rarely stationary and irrelevant acoustic events are not easily classified separate from voice. There is however a useful set of assumptions which can be utilized. Whilst these can be di cult to formally specify, they correspond to the understandings, common sense and constraints of a real meeting environment. The higher order statistical independence of typical acoustic scenes and voice activity can be utilized to gather information selectively in time. The system discussed in this work combines a simple statistical framework, physical source object modeling and operational heuristics to decompose a meeting scene with low latency from an array of three co-incident directional microphones. An overview of the system architecture is presented with speci c details of the raw features, a convenient mapping utilized for clustering and heuristics over several time scales driven by a voice activity classi er. Longer time frames and suitable constraints on the object state provide robust operation and allow for the use of scene information for an interactive sound field application. Rather than an objective assessment of localization accuracy, the comparative assessment of algorithms and was based on field testing with the key requirements being reliability, testability and understanding potential failure modes. The work is presented as a demonstration and suggestion for the use of light weight computational auditory scene analysis in a deployed voice conference system.
Authors:
Dickins, Glenn; Gunawan, David; Shi, Dong
Affiliations:
Dolby Laboratories, Sydney, Australia; Dolby Laboratories, Beijing, China(See document for exact affiliation information.)
AES Conference:
52nd International Conference: Sound Field Control - Engineering and Perception (September 2013)
Paper Number:
2-2
Publication Date:
September 2, 2013
Subject:
Spatial Field Control Theory and Applications
Click to purchase paper as a non-member or you can login as an AES member to see more options.
No AES members have commented on this paper yet.
To be notified of new comments on this paper you can subscribe to this RSS feed. Forum users should login to see additional options.
If you are not yet an AES member and have something important to say about this paper then we urge you to join the AES today and make your voice heard. You can join online today by clicking here.