Semantic impact - a novel approach for domain concept selection in ontology learning

Wan, Jizheng ORCID: 0000-0002-1069-4582 (2022). Semantic impact - a novel approach for domain concept selection in ontology learning. University of Birmingham. Ph.D.

This is the latest version of this item.

Preview

Wan2022PhD.pdf
Text - Accepted Version
Available under License All rights reserved.
Download (5MB) | Preview

Abstract

One of the remaining challenges of Ontology Learning (OL) is the significant dependence on human interference to decide which of the “learnt” concepts from a training corpus are relevant and/or important to the domain of discourse. Though part of this challenge is deeply rooted in expert knowledge of the application domain, there is no doubt that a good relevance/importance measure with which concepts can be semantically judged serves as a good enhancement to the OL weaponry. A new measure called “Semantic Impact” (SI) is, therefore, proposed to bridge between explicitly defined formal semantics (in the form of ontologies) and the distributional semantics learnt from a vast amount of data.

SI aims to consistently and objectively quantify the semantic importance of a concept by aggregating two different measures: informativeness of a concept and its connectivity (or correlation) with the other concepts. Furthermore, it has been evaluated through two experiments.
The first experiment was conducted within the news domain – using 200 BBC News articles about Donald Trump (between February 2017 and September 2017) to semantically assess the impact of the concepts identified from the corpus/corpora. This experiment successfully learnt, for example, the Date concept is one of the most important concepts in the News domain, even if it has not been included in the BBC Core Concept ontology.

The second experiment was conducted within the biological area – using 2000 documents from PubMed on “Candida” to determine which diseases are more “semantic impact” in the Candida domain knowledge. The results are promising. The proposed system has identified that the most correlated (connected) concept to Disease_D003645 (Sudden Death) is Disease_D003643 (Death) without any pre-defined knowledge (or symbolic processing of such labels). Furthermore, a semantic analogy has been identified between Disease_D008223 (Lymphoma) and Disease_D008228 (Non-Hodgkin Lymphoma) due to a close SI between the two concepts.

In addition, we have systematically evaluated the result from various angles and demonstrated that each component within the SI can produce a good and consistent result. At the macro-level, the overall SI result shows a strong clustering trend. At the micro-level, the SI results for both semantically important and non-important concepts are reasonable and reproducible. Moreover, we have compared it with a contemporary mainstream method to show the advantages of the SI algorithm together with its reproducibility.

Type of Work:

Thesis (Doctorates > Ph.D.)

Award Type:

Doctorates > Ph.D.

Supervisor(s):