Multimodal intent recognition for natural human-robotic interaction

Rossiter, James (2011). Multimodal intent recognition for natural human-robotic interaction. University of Birmingham. Ph.D.


Download (9MB)


The research questions posed for this work were as follows: Can speech recognition and techniques for topic spotting be used to identify spoken intent in unconstrained natural speech? Can gesture recognition systems based on statistical speech recognition techniques be used to bridge the gap between physical movements and recognition of gestural intent? How can speech and gesture be combined to identify the overall communicative intent of a participant with better accuracy than recognisers built for individual modalities? In order to answer these questions a corpus collection experiment for Human-Robotic Interaction was designed to record unconstrained natural speech and 3 dimensional motion data from 17 different participants. A speech recognition system was built based on the popular Hidden Markov Model Toolkit and a topic spotting algorithm based on usefulness measures was designed. These were combined to create a speech intent recognition system capable of identifying intent given natural unconstrained speech. A gesture intent recogniser was built using the Hidden Markov Model Toolkit to identify intent directly from 3D motion data. Both the speech and gesture intent recognition systems were evaluated separately. The output from both systems were then combined and this integrated intent recogniser was shown to perform better than each recogniser separately. Both linear and non-linear methods of multimodal intent fusion were evaluated and the same techniques were applied to the output from individual intent recognisers. In all cases the non-linear combination of intent gave the highest performance for all intent recognition systems. Combination of speech and gestural intent scores gave a maximum classification performance of 76.7% of intents correctly classified using a two layer Multi-Layer Perceptron for non-linear fusion with human transcribed speech input to the speech classifier. When compared to simply picking the highest scoring single modality intent, this represents an improvement of 177.9% over gestural intent classification, 67.5% over a human transcription of speech based speech intent classifier and 204.4% over an automatically recognised speech based speech intent classifier.

Type of Work: Thesis (Doctorates > Ph.D.)
Award Type: Doctorates > Ph.D.
College/Faculty: Colleges (2008 onwards) > College of Engineering & Physical Sciences
School or Department: School of Engineering, Department of Electronic, Electrical and Systems Engineering
Funders: Engineering and Physical Sciences Research Council
Subjects: T Technology > T Technology (General)
T Technology > TS Manufactures
T Technology > TJ Mechanical engineering and machinery
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QC Physics


Request a Correction Request a Correction
View Item View Item


Downloads per month over past year