Dual-stream recurrent convolutional neural networks as models of human audiovisual perception

Downloads

Downloads per month over past year

Joannou, Michael (2022). Dual-stream recurrent convolutional neural networks as models of human audiovisual perception. University of Birmingham. Ph.D.

[img]
Preview
Joannou2022PhD.pdf
Text
Available under License All rights reserved.

Download (4MB) | Preview

Abstract

Multisensory perception allows humans to operate successfully in the world. Increasingly, deep neural networks (DNNs) are used as models of human unisensory perception. In this work, we take some of the first steps to extend this line of research from the unisensory to the multisensory
domain, specifically, audiovisual perception. First, we produce a highly-controlled, large, labelled dataset of audiovisual action events for human vs DNN studies. Next, we introduce a novel deep neural network architecture that we name a ‘dual-stream recurrent convolutional neural network’ (DRCNN), consisting of 2 component CNNs joined by a novel ‘multimodal squeeze unit’ and fed into an RNN. We develop a series of these architectures, leveraging a number of pretrained state-of-the-art CNNs, and train a number of instances of each, producing a series of classifiers. We find that, after optimising 12 classifier instances on audiovisual action recognition, all classifiers are able to solve the audiovisual correspondence problem, indicating that this ability may be a consequence of the task constraints. Further, we find that these classifiers are highly affected by signals in the unattended to modality during unimodal classification tasks, demonstrating a high level of integration across modalities. Further experiments revealed that dual-stream RCNN classifiers perform significantly worse than humans on a visual-only action recognition task when stimuli was clean or distorted by Gaussian noise or Gaussian blur. Both classifiers and humans were able to leverage audio information to increase their levels of performance in the clean condition, and to significantly decrease the effect of visual distortion on their audiovisual performances. Indeed, 5/6 classifiers performed within the range of human performance on clean audiovisual stimuli, and 3/6 maintained human level performance when low levels of Gaussian noise were introduced.

Type of Work: Thesis (Doctorates > Ph.D.)
Award Type: Doctorates > Ph.D.
Supervisor(s):
Supervisor(s)EmailORCID
Noppeney, UtaUNSPECIFIEDUNSPECIFIED
Bohnet, BerndUNSPECIFIEDUNSPECIFIED
Rotshtein, PiaUNSPECIFIEDUNSPECIFIED
Licence: All rights reserved
College/Faculty: Colleges (2008 onwards) > College of Life & Environmental Sciences
School or Department: School of Psychology
Funders: Engineering and Physical Sciences Research Council
Subjects: Q Science > Q Science (General)
URI: http://etheses.bham.ac.uk/id/eprint/12940

Actions

Request a Correction Request a Correction
View Item View Item

Downloads

Downloads per month over past year