3D Pose and shape estimation of hands and manipulated objects from images and videos

Tse, Tze Ho Elden (2024). 3D Pose and shape estimation of hands and manipulated objects from images and videos. University of Birmingham. Ph.D.

[img]
Preview
Tse2024PhD.pdf
Text - Accepted Version
Available under License All rights reserved.

Download (30MB) | Preview

Abstract

3D shape and pose estimation of hands and manipulated object is an important and long-standing problem in computer vision. This problem can be particularly challenging due to extreme variations in object shape and texture. In addition, heavy occlusions can be introduced by other objects in the scene or humans during interaction. Nevertheless, modeling hand-object manipulations is essential for understanding how humans interact with the physical world. There are many challenges in modeling hand-object interactions. In this thesis, we focus on three outstanding challenges which unifies geometry-driven and data-driven methods: (1) estimating 3D pose and shape from 2D images is an extremely ill-posed problem due to the loss of depth information during 3D projection to 2D, (2) inexpressive and physically implausible 3D hand reconstructions, 3) inability to recognise seen actions on unseen objects. In this thesis, we propose three main contributions to overcome these challenges. First, we present a new collaborative learning strategy in which two branches of deep neural network mutually exchange information for 3D hand-object reconstruction from single RGB image. Second, we present a new Transformer-based method that estimates the absolute root pose and shape of two-hands with extended forearm at high resolution from egocentric RGB images. Third, we present a new method for compositional action recognition by leveraging 3D geometric information from egocentric RGB videos. Specifically, we exploit superquadrics for both template-free object reconstruction and interaction recognition. This thesis pushes the state-the-art for understanding hand and object from RGB images and videos. First, we show that a collaborative learning framework which allows sharing of 3D geometric information across two branches of networks iteratively can tackle the problem of mutual occlusions. Through this novel network architecture design, we achieve state-of-the-art performance on several common public benchmarks. Second, we present the first method that reconstruct high fidelity two-hand meshes with extended forearms from multi-view RGB images. We demonstrate that by leveraging the properties of graph Laplacian from spectral graph theory can effectively aggregate multi-view features as well as producing smooth meshes. Third, we explore superquadrics as an alternative 3D object representation to bounding boxes and demonstrate that it is beneficial to recognising seen actions on unseen objects.

Type of Work: Thesis (Doctorates > Ph.D.)
Award Type: Doctorates > Ph.D.
Supervisor(s):
Supervisor(s)EmailORCID
Leonardis, AlesUNSPECIFIEDUNSPECIFIED
Chang, Hyung JinUNSPECIFIEDUNSPECIFIED
Licence: All rights reserved
College/Faculty: Colleges > College of Engineering & Physical Sciences
School or Department: School of Computer Science
Funders: None/not applicable
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
URI: http://etheses.bham.ac.uk/id/eprint/15153

Actions

Request a Correction Request a Correction
View Item View Item

Downloads

Downloads per month over past year