Dual-stream recurrent convolutional neural networks as models of human audiovisual perception

Download Statistics

Downloads

Downloads per month over past year

Joannou, Michael (2022). Dual-stream recurrent convolutional neural networks as models of human audiovisual perception. University of Birmingham. Ph.D.

Preview

Joannou2022PhD.pdf
Text
Available under License All rights reserved.
Download (4MB) | Preview

Abstract

Multisensory perception allows humans to operate successfully in the world. Increasingly, deep neural networks (DNNs) are used as models of human unisensory perception. In this work, we take some of the first steps to extend this line of research from the unisensory to the multisensory
domain, specifically, audiovisual perception. First, we produce a highly-controlled, large, labelled dataset of audiovisual action events for human vs DNN studies. Next, we introduce a novel deep neural network architecture that we name a ‘dual-stream recurrent convolutional neural network’ (DRCNN), consisting of 2 component CNNs joined by a novel ‘multimodal squeeze unit’ and fed into an RNN. We develop a series of these architectures, leveraging a number of pretrained state-of-the-art CNNs, and train a number of instances of each, producing a series of classifiers. We find that, after optimising 12 classifier instances on audiovisual action recognition, all classifiers are able to solve the audiovisual correspondence problem, indicating that this ability may be a consequence of the task constraints. Further, we find that these classifiers are highly affected by signals in the unattended to modality during unimodal classification tasks, demonstrating a high level of integration across modalities. Further experiments revealed that dual-stream RCNN classifiers perform significantly worse than humans on a visual-only action recognition task when stimuli was clean or distorted by Gaussian noise or Gaussian blur. Both classifiers and humans were able to leverage audio information to increase their levels of performance in the clean condition, and to significantly decrease the effect of visual distortion on their audiovisual performances. Indeed, 5/6 classifiers performed within the range of human performance on clean audiovisual stimuli, and 3/6 maintained human level performance when low levels of Gaussian noise were introduced.

Type of Work:

Thesis (Doctorates > Ph.D.)

Award Type:

Doctorates > Ph.D.

Supervisor(s):