Deep learning methods for multi-level surgical video scene understanding

Hao, Luoying (2025). Deep learning methods for multi-level surgical video scene understanding. University of Birmingham. Ph.D.

[img] Hao2025PhD.pdf
Text - Accepted Version
Available under License All rights reserved.

Download (29MB)

Abstract

Medical video analysis spans a wide range of applications, offering valuable insights into patient care and clinical decision-making. Among these, surgical videos have emerged as an especially rich source of data, providing unique opportunities to analyze and optimize complex clinical workflows with high precision and contextual depth. This trend is particularly evident with the rise of Minimally Invasive Surgery (MIS). While MIS has transformed the operating room into a data-rich environment, it has also increased the complexity of surgical workflows and the cognitive workload on surgeons. Consequently, optimizing these workflows and reducing the burden on surgeons via intelligent systems that provide clinical decision support, known as Context-Aware Systems (CAS), is a growing necessity. A main component of such systems is the ability to automatically recognize the current state of the surgery. This is effectively accomplished by modeling workflows as a hierarchy of activities defined at different levels of detail, such as phases, steps, and actions. Despite extensive research in surgical activity recognition, a majority of these efforts have focused on coarse-grained phase recognition. A comprehensive, multi-level understanding remains a significant challenge, not only due to data scarcity but also due to key methodological limitations. Current models often fail to explicitly model the hierarchical context between activities, are susceptible to learning spurious correlations from visual biases, and lack the robustness to generalize across different clinical environments. Addressing these AI challenges is crucial for advancing the capabilities of CAS.

This thesis aims to develop advanced deep learning models for multi-level medical video scene understanding by tackling the aforementioned limitations. First, to address the lack of suitable data, we construct a comprehensive cataract surgery dataset with synchronized, multi-level annotations. Next, we develop a novel detection network to improve the accuracy of fine-grained action recognition and, for the first time, to provide confidence scores for clinical trust. Subsequently, we propose a hierarchical framework that explicitly models the contextual relationships between activity levels. To tackle spurious correlations, we introduce a causality-inspired framework to mitigate confounding biases and enhance model robustness. Finally, we develop a foundation model and validate its generalization capabilities across multiple clinical centers.

Collectively, the findings demonstrate that by explicitly modeling hierarchical context and mitigating confounding biases through causality-inspired frameworks, robust and generalizable multi-level medical video understanding is achievable. The novel dataset and methodologies presented in this thesis provide a foundation for developing next-generation CAS, ultimately enhancing clinical training, safety, and efficiency.

Type of Work: Thesis (Doctorates > Ph.D.)
Award Type: Doctorates > Ph.D.
Supervisor(s):
Supervisor(s)EmailORCID
Duan, JinmingUNSPECIFIEDUNSPECIFIED
He, ShanUNSPECIFIEDUNSPECIFIED
Licence: All rights reserved
College/Faculty: Colleges > College of Engineering & Physical Sciences
School or Department: School of Computer Science
Funders: None/not applicable
Subjects: T Technology > T Technology (General)
URI: http://etheses.bham.ac.uk/id/eprint/16857

Actions

Request a Correction Request a Correction
View Item View Item

Downloads

Downloads per month over past year