eTheses Repository

Fine-grained Arabic named entity recognition

Alotaibi, Fahd Saleh S. (2015)
Ph.D. thesis, University of Birmingham.

Loading
PDF (1759Kb)Accepted Version

Abstract

This thesis addresses the problem of fine-grained NER for Arabic, which poses unique linguistic challenges to NER; such as the absence of capitalisation and short vowels, the complex morphology, and the highly in infection process. Instead of classifying the detected NE phrases into small sets of classes, we target a broader range (i.e. 50 fine-grained classes 'hierarchal-based of two levels') to increase the depth of the semantic knowledge extracted. This has increased the number of classes, complicating the task, when compared with traditional (coarse-grained) NER, because of the increase in the number of semantic classes and the decrease in semantic differences between fine-grained classes.

Our approach to developing fine-grained NER relies on two different supervised Machine Learning (ML) technologies (i.e. Maximum Entropy 'ME' and Conditional Random Fields 'CRF'), which require annotated training data in order to learn by extracting informative features. We develop a methodology which exploit the richness of Arabic Wikipedia (A W) in order to create a scalable fine-grained lexical resource and a corpus automatically. Moreover, two gold-standard created corpora from different genres were also developed to perform comparable evaluation. The thesis also developed a new approach to feature representation by relying on the dependency structure of the sentence to overcome the limitation of traditional window-based (i.e. n-gram) representation. Furthermore, by exploiting the richness of unannotated textual data to extract global informative features using word-level clustering technique was also achieved. Each contribution was evaluated via controlled experiment and reported using three commonly applied metrics, i.e. precision, recall and harmonic F-measure.

Type of Work:Ph.D. thesis.
Supervisor(s):Lee, Mark
School/Faculty:Colleges (2008 onwards) > College of Engineering & Physical Sciences
Department:School of Computer Science
Subjects:PJ Semitic
QA75 Electronic computers. Computer science
QA76 Computer software
Institution:University of Birmingham
ID Code:5970
This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder.
Export Reference As : ASCII + BibTeX + Dublin Core + EndNote + HTML + METS + MODS + OpenURL Object + Reference Manager + Refer + RefWorks
Share this item :
QR Code for this page

Repository Staff Only: item control page