eTheses Repository

Multimodal intent recognition for natural human-robotic interaction

Rossiter, James (2011)
Ph.D. thesis, University of Birmingham.

Loading
PDF (9Mb)

Abstract

The research questions posed for this work were as follows: Can speech recognition and techniques for topic spotting be used to identify spoken intent in unconstrained natural speech? Can gesture recognition systems based on statistical speech recognition techniques be used to bridge the gap between physical movements and recognition of gestural intent? How can speech and gesture be combined to identify the overall communicative intent of a participant with better accuracy than recognisers built for individual modalities? In order to answer these questions a corpus collection experiment for Human-Robotic Interaction was designed to record unconstrained natural speech and 3 dimensional motion data from 17 different participants. A speech recognition system was built based on the popular Hidden Markov Model Toolkit and a topic spotting algorithm based on usefulness measures was designed. These were combined to create a speech intent recognition system capable of identifying intent given natural unconstrained speech. A gesture intent recogniser was built using the Hidden Markov Model Toolkit to identify intent directly from 3D motion data. Both the speech and gesture intent recognition systems were evaluated separately. The output from both systems were then combined and this integrated intent recogniser was shown to perform better than each recogniser separately. Both linear and non-linear methods of multimodal intent fusion were evaluated and the same techniques were applied to the output from individual intent recognisers. In all cases the non-linear combination of intent gave the highest performance for all intent recognition systems. Combination of speech and gestural intent scores gave a maximum classification performance of 76.7% of intents correctly classified using a two layer Multi-Layer Perceptron for non-linear fusion with human transcribed speech input to the speech classifier. When compared to simply picking the highest scoring single modality intent, this represents an improvement of 177.9% over gestural intent classification, 67.5% over a human transcription of speech based speech intent classifier and 204.4% over an automatically recognised speech based speech intent classifier.

Type of Work:Ph.D. thesis.
Supervisor(s):Russell, Martin
School/Faculty:Colleges (2008 onwards) > College of Engineering & Physical Sciences
Department:School of Electronic, Electrical and Computer Engineering
Subjects:T Technology (General)
TS Manufactures
TJ Mechanical engineering and machinery
QA75 Electronic computers. Computer science
QC Physics
Institution:University of Birmingham
ID Code:1469
This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder.
Export Reference As : ASCII + BibTeX + Dublin Core + EndNote + HTML + METS + MODS + OpenURL Object + Reference Manager + Refer + RefWorks
Share this item :
QR Code for this page

Repository Staff Only: item control page