Designing Provably Convergent Algorithms from the Geometry of Data
Abstract: Machine learning is a powerful modelling tool, garnering massive popularity in both academic research and industrial applications due to its ease of use and seemingly general-purpose capabilities. However, fundamental research into its theoretical properties has lagged behind. While research into why machine learning works is currently limited, research into how we can make machine learning work the way we want has been fruitful. Inspired by functional analysis and its layer-wise construction, neural networks can be made to have useful properties such as convexity, non-expansiveness, monotonicity and invertibility. In this dissertation, we develop algorithms in the context of optimisation and sampling, leveraging the expressive strengths of machine learning while maintaining theoretical interpretations.
First using computational imaging as an example application of optimisation, and more generally linear inverse problems, we consider variational regularisation. This is a classical mathematical model for image reconstruction, which has seen use in the machine learning regime as a ``more interpretable'' technique. We target three different components of the regulariser-based reconstruction pipeline. (i) We consider inference speed when using a predefined regulariser, and more generally, accelerating large-scale convex optimisation. Utilising machine learning in the mirror descent algorithm, we significantly accelerate convex optimisation in the presence of similar data, defined by optimisation objective functions. (ii) We then consider when the regularisation term is instead implicitly defined by its proximal map, given by a denoiser. In such a setting, we prove that it is additionally possible to define a second-order scheme with superlinear convergence to critical points. (iii) We finally consider explicitly learning a data-driven regulariser in the absence of ground-truth data, which presents a significant challenge when compared to typical machine learning models. Utilising a particular stochastic gradient approximator, we demonstrate that the gradients can be used to train small neural networks as well, and that there is only a small performance loss when compared to training in the presence of ground-truths. These algorithms are derived using methods from (i) convex analysis and optimisation, (ii) monotone operator theory and numerical analysis, (iii) stochastic optimisation and Markov chain theory.
Changing to the stochastic setting, we additionally consider sampling, as sampling from a Gibbs distribution, and as a universal data preprocessing method for machine learning. (iv) We derive a principled score approximator using Wasserstein proximals of distributions, bypassing the requirement of choosing an appropriate kernel or learn a neural network approximation. This is then applied to derive a sampling algorithm that exhibits interesting structures, seemingly lying on level set contours of the underlying distribution. (v) Using the SDE backbone of current diffusion-based generative models, we utilise powerful pre-trained models as a surrogate to a distribution of image data. The diffusion structure is then used to derive a provably convergent dataset distillation technique, which finds a small training set that maximises performance after training. These algorithms are derived using techniques from Monte Carlo methods, Wasserstein metrics, optimal quantisation, and diffusion processes.