Optimal 0−1 loss classification in linear, overlapping and interpolation settings

Abdulrhman Alanazi, Reem (2023). Optimal 0−1 loss classification in linear, overlapping and interpolation settings. University of Birmingham. Ph.D.

AbdulrhmanAlanazi2023PhD.pdf
Text - Accepted Version
Restricted to Repository staff only until 12 December 2025.
Available under License All rights reserved.
Download (5MB) | Request a copy

Abstract

Classification problems represent a major subject in machine learning, and addressing them requires solving underlying optimisation problems. Optimis- ing the 0–1 loss function is the natural pathway to address the classification problem. Because of the properties of 0–1 loss, which are that it is non-convex and non-differentiable, 0–1 loss is mathematically intractable and classified as non-deterministic polynomial-time hard (NP-hard). Consequently, the 0– 1 loss function has been replaced by surrogate loss functions that can be optimised more efficiently, where their optimal solution is guaranteed with respect to these surrogate losses. At the same time, these functions may not provide the same optimal solution as the 0–1 loss function. Indeed, the loss function used during the empirical risk minimisation (ERM) is not the same loss function used in evaluation; the mismatch of the loss functions leads to an approximate solution that may not be an ideal solution for 0–1 loss. Thus, an additional source of error is produced because of this replacement.
In this thesis, optimising 0–1 loss directly is discussed from three per- spectives. First, the consequences of replacing the 0–1 loss function with surrogate functions from the theoretical perspective for linearly separable data are outlined, showing that hinge loss is an exact minimiser for 0–1 loss under this setting. Second, in the case of overlapping, it is determined that the surrogate functions do not provide an exact solution. Therefore, we compared the actual time complexity of existing 0–1 loss optimisation algo- rithms based on the following second-order factors: dimensionality of data and amount of overlap between classes. In addition, a new heuristic 0–1 loss optimisation algorithm was introduced that can optimise 0–1 risk exactly in practical way in $O(\sqrt{D}\log(\frac{1}{\epsilon})N^3)$ polynomial complexity class. A new al- gorithm, called testing all pairs of points (T-PP), as well as its prioritised version, was introduced; the prioritised version has actual time complexity that is less than T-PP, and it is referred to as prioritising the search by support vectors (PS-SVs). T-PP and PS-SVs can terminate before the time limit, whereas the time taken by other direct 0–1 loss optimisation algorithms depends on second-order factors when the data size is fixed. Third, a new di- rection of ERM is discussed in a simple classification algorithm, which is the canonical K nearest neighbour (KNN) algorithm that still dominates by the conventional ERM. The interpolated local KNN is our new algorithm, which works on modern interpolation ERM view. The interpolated local KNN aims to fit the training points locally using a set of local K parameters that are determined by a pre-defined confidence score. The mechanism used in the interpolated local KNN helps to define the noise and overlap points and then isolate them from being used in predicting unseen points, whereas the iso- lated points have their own decision boundary that can be used to predict them correctly. Thus, we can have zero empirical risk and low generalisation risk compared with the canonical KNN.
These contributions show the possibility of optimising the 0–1 loss directly in polynomial time for linearly separable data using a surrogate loss function; for overlapping data, our new algorithms T-PP and PS-SVs can be used. Ultimately, we introduce a first interpolation version for the canonical KNN algorithm. This thesis focuses on the binary classification problem; our main focus in the future will be to extend this work to multiclass classification.