eTheses Repository

Towards an improved model of dynamics for speech recognition and synthesis

Hu, Hongwei (2012)
Ph.D. thesis, University of Birmingham.

PDF (2289Kb)


This thesis describes the research on the use of non-linear formant trajectories to model speech dynamics under the framework of a multiple-level segmental hidden Markov model (MSHMM). The particular type of intermediate-layer model investigated in this study is based on the 12-dimensional parallel formant synthesiser (PFS) control parameters, which can be directly used to synthesise speech with a formant synthesiser. The non-linear formant trajectories are generated by using the speech parameter generation algorithm proposed by Tokuda and colleagues. The performance of the newly developed non-linear trajectory model of dynamics is tested against the piecewise linear trajectory model in both speech recognition and speech synthesis. In speech synthesis experiments, the 12 PFS control parameters and their time derivatives are used as the feature vectors in the HMM-based text-to-speech system. The human listening test and objective test results show that, despite the low overall quality of the synthetic speech, the non-linear trajectory model of dynamics can significantly improve the intelligibility and naturalness of the synthetic speech. Moreover, the generated non-linear formant trajectories match actual formant trajectories in real human speech fairly well. The \(\char{cmmi10}{0x4e}\)-best list rescoring paradigm is employed for the speech recognition experiments. Both context-independent and context-dependent MSHMMs, based on different formant-to-acoustic mapping schemes, are used to rescore an \(\char{cmmi10}{0x4e}\)-best list. The rescoring results show that the introduction of the non-linear trajectory model of formant dynamics results in statistically significant improvement under certain mapping schemes. In addition, the smoothing in the non-linear formant trajectories has been shown to be able to account for contextual effects such as coarticulation.

Type of Work:Ph.D. thesis.
Supervisor(s):Russell, Martin
School/Faculty:Colleges (2008 onwards) > College of Engineering & Physical Sciences
Department:Department of Electronic, Electrical and Computer Engineering
Subjects:QA75 Electronic computers. Computer science
T Technology (General)
TK Electrical engineering. Electronics Nuclear engineering
Institution:University of Birmingham
ID Code:3704
This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder.
Export Reference As : ASCII + BibTeX + Dublin Core + EndNote + HTML + METS + MODS + OpenURL Object + Reference Manager + Refer + RefWorks
Share this item :
QR Code for this page

Repository Staff Only: item control page