Multimodal Motion Detection Using Alternating Diffusion

One of MediaEval’s 2019 benchmark's challenges was no-audio speech detection system based on video and IMU (human movements that accompany speaking). The challenge provided a wide dataset from 3 “cocktail party” experiments. In this project we approached this challenge with manifold learning method – Alternating Diffusion maps. Our suggested solution involves two steps: motion detection of one speaker in order to focus the video, and speech detection over the focused video. For motion detection we proposed an algorithm based on heatmaps created using Alternating Diffusion kernels to evaluate the correlation between the two sensors. We created simulation in order to conclude how to tune the model's hyperparameters. Using these conclusions, we successfully detected motion in the raw data. In order to detect speech, we have used Diffusion maps algorithm and variety of visualization and preprocessing methods.
Unfortunately, we didn't achieve speech detection using the methods described above. We concluded that the proof of concept of speech detection requires simpler and more controlled dataset, and it is likely to assume that single IMU source is not informative enough for speech detection.

Multimodal Motion Detection  Using Alternating Diffusion