Looking to Listen: Audio-Visual Speech Separation with Deep Learning

In the recent years there has been a significant increase of machine learning usage in solving the speaker separation and speech enhancement problems. More specifically, the use of neural network has become widespread and led to great improvement in the performance of such systems. Nevertheless, the speaker separation problem remains a difficult one and systems which are only audio based have limited performance. In order to boost the performance of such systems, a mixture of audio signal and video signal in multimodal system will be examined, due to the high correlation between the video of the speaker and the clean (expected) audio signal. The two signals will be combined and passed through a neural network. The purpose of this project is to examine whether video improves speech enhancement systems and if so, by how much the quality increases. Furthermore, the performance of speech enhancement system was examined for solving the speaker separation problem. In this project several systems for speech enhancement were reviewed, one of them was chosen and tested and its architecture was integrated with audio-visual systems which was trained and evaluated on audio signals simulating natural environment. A multimodal system was implemented and obtained 70% precision on the audio branch alone, whereas the entire audio-video system obtained 75% precision for speech enhancement task. Adding the video helped improve the performance of the system.

Looking to Listen: Audio-Visual Speech Separation with Deep Learning
Looking to Listen: Audio-Visual Speech Separation with Deep Learning