Community

AES Convention Papers Forum

Convolutional Transformer for Neural Speech Coding

Document Thumbnail

In this paper, we propose a Convolutional-Transformer speech codec which utilizes stacks of convolutions and self-attention layers to remove redundant information at the downsampling and upsampling blocks of a U-Net-style encoder-decoder neural codec architecture. We design the Transformers to use channel and temporal attention with any number of attention stages and heads while maintaining causality. This allows us to take into consideration the characteristics of the input vectors and flexibly utilize temporal and channel-wise relationships at different scales when encoding the salient information that is present in speech. This enables our model to reduce the dimensionality of its latent embeddings and improve its quantization efficiency while maintaining quality. Experimental results demonstrate that our approach achieves significantly better performance than convolution-only baselines.

Authors:
Affiliations:
AES Convention: Paper Number:
Publication Date:
Subject:

Click to purchase paper as a non-member or you can login as an AES member to see more options.

No AES members have commented on this Signal Processing yet.

Subscribe to this discussion

RSS Feed To be notified of new comments on this Signal Processing you can subscribe to this RSS feed. Forum users should login to see additional options.

Start a discussion!

If you would like to start a discussion about this Signal Processing and are an AES member then you can login here:
Username:
Password:

If you are not yet an AES member and have something important to say about this Signal Processing then we urge you to join the AES today and make your voice heard. You can join online today by clicking here.

AES - Audio Engineering Society