Machine Learning Applications in Speech Processing

Recent advances in machine learning, signal and information processing have brought about new opportunities to solve several long-standing technological challenges in acoustic speech processing. This symposium focuses on the use of such advances to improve the state of the art in hands-free, human-to-human and human-to-machine speech applications. It is well known that speech signals captured by distant microphones in real-world environments are generally corrupted by reverberation and interference from competing audio sources. These acoustic distortions pose a significant hinderance to the wider adoption of voice-driven applications. The objective of this symposium is to foster new signal and information processing trends in the field of speech by creating a platform that brings together researchers and practitioners from academia and industry to propose and exchange ideas and findings on this topic.

We invite papers describing various aspects of machine learning, signal and information processing for both speech enhancement and recognition. Topics of interest in this symposium include:

Keynote Speakers

A Deep Learning Approach to Acoustic Signal Processing

Chin-Hui Lee, School of Electrical and Computer Engineering, Georgia Institute of Technology

Wednesday, 9:15–10:15. Conference 7

Abstract: In contrast to conventional model-based acoustic signal processing, we formulate a given acoustic signal processing problem in a novel deep learning framework as finding a mapping function between the observed signal and the desired targets. Monte Carlo techniques are often required to generate a large collection of signal pairs in order to learn the often-complicated structure of the a mapping functions. In the case of speech enhancement, to be able to handle a wide range of additive noises in real-world situations, a large training set, encompassing many possible combinations of speech and noise types, is first designed. Next deep neural network (DNN) architectures are employed as nonlinear regression functions to ensure a powerful approximation capability. In the case of source separation a similar simulation methodology can also be adopted. In the case of speech bandwidth expansion, the target wideband signals can be filtered and down-sampled to create the needed narrowband training examples. Finally in the case of acoustic de-reverberation, a wide variety of simulated room impulse responses are needed to generate a good training set.

When reconstructing the desired target signals, some additional techniques may be required. For example, noisy or missing phase information may need to be estimated in order to enhance the quality of the synthesized signals. Experimental results demonstrate that the proposed framework can achieve significant improvements in both objective and subjective measures over the conventional techniques in speech enhancement, speech source separation and bandwidth expansion. It is also interesting to observe that the proposed DNN approach can also serve as an acoustic preprocessing front-end for robust speech recognition to improve performance with or without post-processing.

Biography: Chin-Hui Lee is a professor at School of Electrical and Computer Engineering, Georgia Institute of Technology. Before joining academia in 2001, he had 20 years of industrial experience ending in Bell Laboratories, Murray Hill, New Jersey, as a Distinguished Member of Technical Staff and Director of the Dialogue Systems Research Department. Dr. Lee is a Fellow of the IEEE and a Fellow of ISCA. He has published over 400 papers and 30 patents, and was highly cited for his original contributions with an h-index of 66. He received numerous awards, including the Bell Labs President's Gold Award in 1998. He won the SPS's 2006 Technical Achievement Award for “Exceptional Contributions to the Field of Automatic Speech Recognition”. In 2012 he was invited by ICASSP to give a plenary talk on the future of speech recognition. In the same year he was awarded the ISCA Medal in scientific achievement for “pioneering and seminal contributions to the principles and practice of automatic speech and speaker recognition”.

Learning representations of speech in neural network acoustic models

Steve Renals, University of Edinburgh

Wednesday, 10:15–11:15. Conference 7

Abstract: Deep neural networks have a made a significant impact on acoustic modelling and language modelling for speech recognition, in part due to their ability to learn suitable representations for speech recognition. In this talk I shall discuss our recent work neural network acoustic modelling, focusing on three areas: (1) how neural networks can learn suitable representations for distant speech recognition based on multichannel input; (2) compact model-based speaker adaptation for neural network acoustic models which can operate in either supervised or unsupervised fashion; (3) automatic domain adaptation of neural network acoustic models.

Biography: Steve Renals is Professor of Speech Technology at the University of Edinburgh. He has research interests in speech and language technology, with over 200 publications in the area, with recent work on neural network acoustic models, cross-lingual speech recognition, and meeting recognition. He leads the EPSRC-funded Natural Speech Technology programme in the UK, is senior area editor of IEEE Transactions on Audio, Speech, and Language Processing, and is a member of the ISCA Advisory Council. He is a fellow of the IEEE, and was previously co-editor-in-chief of the ACM Transactions on Speech and Language Processing, and an associate editor of IEEE Signal Processing Letters.



For all inquiries and questions please contact Dr. Mehrez Souden at