Machine learning applications in drug discovery

Ugurlu, Sadettin Yavuz (2025). Machine learning applications in drug discovery. University of Birmingham. Ph.D.

Ugurlu2025PhD.pdf
Text - Accepted Version
Available under License All rights reserved.
Download (28MB)

Abstract

Machine learning (ML) applications in drug discovery and development offer significant advantages over traditional methods, such as faster outcomes, improved accuracy, and reduced costs. However, the drive to achieve enhanced performance has often resulted in adopting highly complex models, including deep learning (DL) approaches, which can compromise interpretability. This has highlighted the need for alternative approaches that balance performance and interpretability. Specifically, straightforward yet effective methods are essential for advancing our understanding of key drug discovery and development processes. Such approaches should address performance limitations without sacrificing clarity, particularly in critical areas such as (i) blind docking, (ii) the identification of allosteric binding sites, and (iii) PROteolysis TArgeting Chimaeras (PROTAC) screening. (i) Probing the surface of proteins to predict the binding site and binding affinity for a given small molecule is a critical but challenging task in drug discovery. Blind docking addresses this issue by performing docking on binding regions randomly sampled from the entire protein surface. However, compared with local docking, blind docking is less accurate and reliable because the docking space is too ample to be sufficiently sampled. Cavity detection-guided blind docking methods improved the accuracy by using cavity detection (also known as binding site detection) tools to guide the docking procedure. However, it is worth noting that the performance of these methods heavily relies on the quality of the cavity detection tool. This constraint, namely the dependence on a single cavity detection tool, significantly impacts the overall performance of cavity detection-guided methods. To overcome this limitation, we proposed Consensus Blind Dock (CoBDock), a novel blind, parallel docking method that uses ML algorithms to integrate docking and cavity detection results to improve not only binding site identification but also pose prediction accuracy. Our experiments on several datasets, including PDBBind 2020, ADS, MTi, DUD-E, and CASF-2016, showed that CobDock has better binding site and binding mode performance than other state-of-the-art cavity detector tools and blind docking methods. (ii) A crucial mechanism for controlling the actions of proteins is allostery. Allosteric modulators have the potential to provide many benefits in comparison to orthosteric ligands, such as increased selectivity and saturability of their effect. Identifying new allosteric sites presents prospects for creating innovative medications and enhances our understanding of fundamental biological mechanisms. Allosteric sites are increasingly found in different protein families through various techniques, such as ML applications, which opens up possibilities for creating completely novel medications with diverse chemical structures. ML methods, such as PASSer, exhibit limited efficacy in accurately finding allosteric binding sites when relying solely on 3D structural information. Prior to conducting feature selection for allosteric binding site identification, integration of supporting amino-acid-based information to 3D structural knowledge is advantageous. This approach can enhance performance by ensuring accuracy and robustness. Therefore, we have developed an accurate and robust model called Multimodel Ensemble Feature Selection for Allosteric Site Identification (MEF-AlloSite) after collecting 9460 relevant and diverse features from the literature to characterize pockets. The model employs an accurate and robust multimodel feature selection technique for the small training set size of only 90 proteins to improve predictive performance. This state-of-the-art technique increased the performance in allosteric binding site identification by selecting promising features from 9460 features. Also, the relationship between selected features and allosteric binding sites enlightened the understanding of complex allostery for proteins by analyzing chosen features. MEF-AlloSite and state-of-the-art allosteric site identification methods such as PASSer2.0 and PASSerRank have been tested on three test cases 51 times with a different split of the training set. The Student’s t-test and Cohen’s D value have been used to evaluate the average precision and ROC AUC score distribution. On three test cases, most of the p-values (< 0.05) and the majority of Cohen’s D values (> 0.5) showed that MEF-AlloSite’s 1-6% higher mean of average precision and ROC AUC than state-of-the-art allosteric site identification methods are statistically significant. (iii) Proteolysis-targeting chimaeras (PROTACs), which induce proteolysis by recruiting an E3 ligase to dock into a target protein, are acquiring popularity as a novel pharmacological modality because of unique features of PROTAC, including high potency, low dosage, and effectiveness on undruggable targets. While PROTACs are promising prospects as chemical probes and therapeutic agents, their discovery usually necessitates the synthesis of numerous analogs to explore variations on the chemical linker structure exhaustively. Without extensive trial and error, it is unknown how to link the two protein-recruiting moieties to facilitate the formation of a productive ternary complex. Although molecular docking-based and optimization pipelines have been designed to predict ternary complexes, guiding rational PROTAC design, they have suffered from limited predictive performance in the quality of the ternary structure and their ranks. Therefore, MEGA PROTAC has been designed to enhance the performance in the quality and ranking of ternary structures. MEGA PROTAC employs MEGADOCK to execute docking for protein-protein complexes (PPCs). The docking establishes an initial exploration area for PPCs. A sequential filtration strategy combined with rank aggregation is employed to choose a subset of PPCs for grid search. Once candidate PPCs are selected, a grid search method is used separately for translation and rotation. The remaining proteins have been grouped into clusters, and MEGA PROTAC further filters these clusters based on the energy score of the proteins within each cluster. MEGA PROTAC utilizes rank aggregation to choose the best clusters and then employs MEGADOCK to dock PROTAC into the selected PPCs, forming a ternary structure. Finally, MEGA PROTAC was tested on 22 experimentally validated structures representing all currently available data. These cases were used to compare MEGA PROTAC with the state-of-the-art method, Bayesian Optimization for Ternary Complex Prediction (BOTCP). MEGA PROTAC outperformed BOTCP on 16 test cases out of 22 cases, achieving a higher maximum DockQ score with an 18% higher mean and 35% higher median. Also, MEGA PROTAC exhibited 75% superior ranks and a reduced cluster number for maximum DockQ score compared to BOTCP. Also, MEGA PROTAC outperforms BOTCP by achieving a twofold improvement in locating the first acceptable DockQ scores, with a more significant proportion of near-native structures within the detected cluster.

Type of Work:

Thesis (Doctorates > Ph.D.)

Award Type:

Doctorates > Ph.D.

Supervisor(s):