Alharbi, Alaa (2023). Crisis detection from Arabic social media. University of Birmingham. Ph.D.
|
Alharbi2023PhD.pdf
Text - Accepted Version Available under License All rights reserved. Download (5MB) | Preview |
Abstract
Social media (SM) streams such as Twitter provide large quantities of real-time information about emergency events from which valuable information can be extracted to enhance situational awareness and support humanitarian response efforts. The timely extraction of crisisrelated SM messages is challenging as it involves processing large quantities of noisy data in real time. Supervised machine learning classifiers are challenged by out-of-distribution learning when classifying unseen (new) crises due to data variations across events. Besides that, it is impractical to label training data from each novel and emerging crisis since obtaining sufficient labelled data is time-consuming and labour-intensive. This thesis addresses the problem of Twitter crisis classification using supervised learning methods to identify crisis-related data and categorising them into different information types in the multi-source (training data from multiple events) setting. Due to Twitter’s ubiquity during emergency events in the Arab world, the current research focuses on Arabic Twitter content. We have created and published a large-scale Arabic Twitter corpus of crisis events. The corpus has been analysed and manually labelled. Analysing the content includes investigating the main information categories of conversations posted during a range of crisis events using natural language processing techniques. Building these resources is considered one of this thesis’s contributions. The thesis also investigates the generalisation performance of different supervised classical machine learning and deep learning approaches trained on out-of-crisis data to classify unseen crises. We find that deep neural networks such as LSTM and CNN outperform the classical machine learning classifiers such as support vector machines and decision trees. We also evaluate different architectures of deep neural networks and several pre-trained text representations (embeddings) learnt from vast amounts of unlabelled text. Results show that BERT-based models are more robust to out-of-distribution target events and remarkably outperform other models on the information classification task. Experiments show that the performance of BERT-based classifiers can be enhanced when training on similar data. Thus, the last contribution of the present study is to propose an instance distance-based data selection approach for adaptation to improve classifiers’ performance under a domain shift. Using the BERT embeddings, the method selects a subset of multi-event training data that is most similar to the target event. Results show that fine-tuning a BERT model on a selected subset of data to classify crisis tweets outperforms a model that has been fine-tuned on all available source data.
Type of Work: | Thesis (Doctorates > Ph.D.) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Award Type: | Doctorates > Ph.D. | |||||||||
Supervisor(s): |
|
|||||||||
Licence: | All rights reserved | |||||||||
College/Faculty: | Colleges (2008 onwards) > College of Engineering & Physical Sciences | |||||||||
School or Department: | School of Computer Science | |||||||||
Funders: | Other | |||||||||
Other Funders: | Taibah University in Medina | |||||||||
Subjects: | T Technology > T Technology (General) | |||||||||
URI: | http://etheses.bham.ac.uk/id/eprint/14393 |
Actions
Request a Correction | |
View Item |
Downloads
Downloads per month over past year