Hanani, Abualseoud (2012)
Ph.D. thesis, University of Birmingham.
The paralinguistic information in a speech signal includes clues to the geographical and social background of the speaker. This thesis is concerned with automatic extraction
of this information from a short segment of speech. A state-of-the-art Language Identication (ID) system, which is obtained by fusing variant of Gaussian mixture model
and support vector machines, is developed and evaluated on the NIST 2003 and 2005 Language Recognition Evaluation (LRE) tasks. This system is applied to the problems
of regional accent recognition for British English, and ethnic group recognition within a particular accent. We compare the results with human performance and, for accent
recognition, the `text dependent' ACCDIST accent recognition measure. For the fourteen regional accents of British English in the ABI-1 corpus (good quality read speech), our language ID system achieves a recognition accuracy of 86.4%, compared with 95.18% for our best ACCDIST-based system and 58.24% for human listeners. The "Voices across Birmingham" corpus contains signicant amounts of telephone conversational speech for the two largest ethnic groups in the city of Birmingham (UK), namely the `Asian' and `White' communities. Our language ID system distinguishes between these two groups with an accuracy of 94.3% compared with 90.24% for human listeners.
Although direct comparison is difficult, it seems that our language ID system performs much better on the standard twelve class NIST 2003 Language Recognition Evaluation task or the two class ethnic group recognition task than on the fourteen class regional accent recognition task. We conclude that automatic accent recognition is a challenging task for speech technology, and that the use of natural conversational speech may be advantageous for these types of paralinguistic task.
One issue with conventional approaches to language ID that use high-order Gaussian Mixture Models (GMMs) and high-dimensional feature vectors is the amount of computing
power that they require. Currently, multi-core Graphics Processing Units (GPUs)provide a possible solution at very little cost. In this thesis we also explore the application
of GPUs to speech signal and pattern processing, using language ID as a vehicle to demonstrate their benefits. Realisation of the full potential of GPUs requires both
effective coding of predetermined algorithms, and, in cases where there is a choice, selection of the algorithm or technique for a specific function that is most able to exploit the properties of the GPU. We demonstrate these principles using the NIST LRE 2003 task, which involves processing over 600 hours of speech. We focus on two parts of the system, namely the acoustic classifier, which is based on a 2048 component GMM, and the acoustic feature extraction process. In the case of the latter we compare a conventional FFT-based analysis with an FIR filter bank, both in terms of their ability to exploit the GPU architecture and language ID performance. With no increase in error rate our GPU based system, with an FIR-based front-end, completes the full NIST LRE 2003 task in 16 hours, compared with 180 hours for the more conventional FFT-based system on a standard CPU (a speed up factor of more than 11).
This unpublished thesis/dissertation is copyright of the author and/or third parties.
The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged.
Further distribution or reproduction in any format is prohibited without the permission of the copyright holder.
Repository Staff Only: item control page