Model-based reinforcement learning with small sample size

Zhang, Liangpeng (2021). Model-based reinforcement learning with small sample size. University of Birmingham. Ph.D.

Text - Accepted Version
Available under License Creative Commons Attribution Non-commercial.

Download (1MB) | Preview


State-of-the-art reinforcement learning (RL) algorithms generally require a large sample of interaction data to learn sufficiently well, which makes it difficult to apply them to the problems where data is expensive. This thesis studies exploration, transformation bias, and policy selection of model-based RL in finite MDPs, all of which has strong impact on sample efficiency.

Exploration has previously been studied under the setting where learning-process cumulative reward needs to be maximised. When learning-process cumulative reward is irrelevant and sample efficiency is of primary concern, existing strategies become inefficient and analyses become unsuitable. This thesis formulates the planning for exploration problem, and shows that the efficiency of exploration strategies can be better analysed by comparing their behaviours and exploration costs with the optimal exploration scheme of the planning problem. The weaknesses of existing strategies and the advantages of conducting explicit planning for exploration are presented through an exploration behaviour analysis in tower MDPs.

Transformation bias of value estimates in model-based RL has previously been considered insignificant and has not gained much attention. This thesis presents a systematic empirical study to show that when the sample size is small, the transformation bias is not only significant, it can even lead to disastrous effect on the accuracy of value estimates and overall learning performance in some cases. The novel Bootstrap-based Transformation Bias Correction method is proposed to reduce the transformation bias without requiring any additional data. It can work well even when sample size per state-action is very small, which is not possible with the existing method.

Policy selection is rarely studied and has been conducted naively by directly comparing two estimated values in most model-based algorithms, which increases the risk of selecting inferior policies due to the asymmetry of the value estimate distributions. To better study the effectiveness of policy selection, two novel family-wise metrics are proposed and analysed in this thesis. The novel Bootstrap-based Policy Voting method is proposed for policy selection, which can significantly reduce the risk of policy selection failures. Then, two novel tournament-based policy refinement methods are proposed, which can improve general RL performance without needing more data.

Type of Work: Thesis (Doctorates > Ph.D.)
Award Type: Doctorates > Ph.D.
Licence: Creative Commons: Attribution-Noncommercial 4.0
College/Faculty: Colleges (2008 onwards) > College of Engineering & Physical Sciences
School or Department: School of Computer Science
Funders: Engineering and Physical Sciences Research Council, Other, Royal Society
Other Funders: China Scholarship Council
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science


Request a Correction Request a Correction
View Item View Item


Downloads per month over past year