Toward deep slam: deep-learning-based visual localization and mapping

Zhao, Cheng ORCID: 0000-0001-8502-3233 (2020). Toward deep slam: deep-learning-based visual localization and mapping. University of Birmingham. Ph.D.

Zhao2020PhD.pdf
Text - Accepted Version
Restricted to Repository staff only until 31 December 2099.
Available under License All rights reserved.
Download (28MB) | Request a copy

Abstract

This thesis addresses the related problems of deep-learning-based visual localization and mapping. The thesis focuses on two main research topics: Dense Semantic Mapping and Deep Monocular SLAM. Conventional localization and mapping algorithms, which rely on model-based geometry approaches, suffer some limitations including: lack of semantic knowledge; absolute scale-drift; non-linear weakness of probabilistic state estimation; and limited ability of dense 3D mapping. In contrast to the model-based approaches, this thesis reformulates the visual localization and mapping problem as a data-driven learning problem, leveraging the image interpretation ability of Convolutional Neural Networks and the sequential ability of Recurrent Neural Networks.

From a scientific perspective, this thesis exploits deep learning methods to extend 3D SLAM capabilities in a number of different ways. Firstly, we combine 3D reconstruction with visual recognition and pixel-wise segmentation of materials categories, through transfer learning using a Deep Neural Network. It can transfer the learned knowledge from general object recognition to specific material recognition and from image-wise classification task to pixel-wise segmentation task. Secondly, we propose a Pixel-Voxel Network for scene understanding, combining the advantages of different modalities via 2D RGB images and 3D point cloud data. The proposed Softmax fusion can adaptively learn the probabilistic confidence of each modality. Both methods integrate high-level semantic information with the dense 3D mapping. The thesis then presents a Learning Visual Odometry Network through dense 3D flow for ego-motion estimation, which significantly mitigates the absolute scale drift of monocular visual odometry. A Bivariate Gaussian loss function is employed to learn the correlation within the motion directions. Finally, the thesis proposes a Learned Kalman Network as a discriminative state estimator for global trajectory filtering, which combines the non-linear transform property of data-driven deep neural networks, with the probabilistic mechanism of the Kalman Filter. The proposed sparse representation and the LSTM prior can provide powerful spatio-temporal constraints for the trajectory filtering. Moreover, a Deep Neural Network based depth prediction and refinement is employed to enhance the dense 3D mapping for monocular SLAM. From an applications perspective, the proposed material segmentation and Pixel-Voxel Net-work are fully integrated with a Dense Semantic Mapping system, which can deploy dense 3D mapping while simultaneously recognizing and labelling the material or object category for each point in the 3D map. They can provide semantic knowledge for the requirements of intelligent robot applications, especially in complex industrial environments such as in decommissioning of old legacy nuclear facilities. The proposed Learned Visual Odometry Network and Learned Kalman Network are fully integrated with a Monocular SLAM system, which can deploy simultaneous localization and dense mapping in urban environments. It is demonstrated as a monocular visual odometry solution for general on-road driving problems. From an experimental perspective, widely-cited public datasets are employed for evaluating the proposed approaches, such as MINC and ERL Material datasets for material segmentation, SUN RGB-D and NYU V2 datasets for semantic segmentation, KITTI and Apolloscape datasets for visual odometry. From the experimental results, the proposed methods advance the state-of-the-art in both the localization and mapping areas. The proposed methods also show good generalization and robustness in a variety of different circumstances and scenes, rather than specialised adaptation to a specific dataset.

Type of Work:

Thesis (Doctorates > Ph.D.)

Award Type:

Doctorates > Ph.D.

Supervisor(s):