Human attention target estimation and application on images

Horanyi, Nora ORCID: 0000-0002-7135-0222 (2023). Human attention target estimation and application on images. University of Birmingham. Ph.D.

[img]
Preview
Horanyi2023PhD.pdf
Text - Accepted Version
Available under License All rights reserved.

Download (46MB) | Preview

Abstract

Most computer vision applications, such as automatic image cropping and attention target estimation, aim to perform or solve a task as humans would. While recent works using Neural Networks showed promising results in numerous research areas, complex and subjective tasks are still challenging to solve by only deriving information from images and videos. Therefore to enhance the ability of the machine to localise a part of an image or to interpret complex social interactions between multiple people in the scene like humans would, explicit or implicit user input could be integrated into the algorithm. This thesis investigates the usefulness of explicit verbal and implicit non-verbal human social clues and their combination in frameworks designed for attention-based computer vision tasks. The proposed computational methods in this thesis aim to better understand the user’s intention through different input modalities. Specifically, this work used natural language and its combination with eye-tracking user inputs for description-based image cropping and visual attention for joint attention target estimation.

This work studied how a natural language expression of the users could be directly used to automatically localise the described part of an image and output an aesthetically pleasing image crop. The proposed solution re-purposed existing deep learning models into a single optimisation framework to solve this complex, highly subjective problem. In addition to the explicit language expressions and a semi-direct social clue, the eye movements of the users were integrated into a novel multi-modal framework. Finally, motivated by the usefulness of the user’s semi-direct attention input, a deep neural network was developed for estimating attention targets in images to detect and follow the joint attention target of the subjects within the scene.

The presented approaches have achieved state-of-the-art performances in quantitative and qualitative measures on different benchmark datasets in their respective research areas. Furthermore, the conducted studies confirmed that the users favoured the output of the proposed solutions. These findings prove that integrating explicit or implicit user input and their combination into computational methods can produce more human-like outputs.

Type of Work: Thesis (Doctorates > Ph.D.)
Award Type: Doctorates > Ph.D.
Supervisor(s):
Supervisor(s)EmailORCID
Leonardis, AlesUNSPECIFIEDorcid.org/0000-0003-0773-3277
Chang, Hyung JinUNSPECIFIEDorcid.org/0000-0001-7495-9677
Licence: All rights reserved
College/Faculty: Colleges (2008 onwards) > College of Engineering & Physical Sciences
School or Department: School of Computer Science
Funders: Engineering and Physical Sciences Research Council
Subjects: Q Science > Q Science (General)
URI: http://etheses.bham.ac.uk/id/eprint/13746

Actions

Request a Correction Request a Correction
View Item View Item

Downloads

Downloads per month over past year