What is Bayesian Optimization and How is it Used in Machine Learning?

Introduction

Optimization is an inherent human behavior that drives societies to improve their surroundings. Design problems are widespread across various fields, such as engineering, pharmaceuticals, software development, and more. These problems often involve complex and high-dimensional decisions that are difficult to solve. Automating design decisions is crucial for advancing products and innovation across multiple fields.

The Bayesian approach to optimization, which uses statistical models to get insights into the objective function, has been continuously improved since the 1960s. The optimization aim, such as lowering costs or improving performance, is represented by this objective function. Bayesian optimization has found its niche in optimizing objectives that are:

Costly to compute, preventing exhaustive evaluation. (Bayesian Optimization of Expensive Cost Functions)
Lacking a useful expression, functioning as “black boxes.”
Not evaluated exactly, but through indirect or noisy mechanisms. (domain with noise, domain without noise)
Offering no efficient mechanism for estimating their gradient.

Machine learning algorithms often involve numerous hyperparameters that significantly influence their performance. To effectively utilize these algorithms, it is crucial to select optimal hyperparameter values (bayesian hyperparameter optimization).

To ensure success, data scientists must carefully tune the model’s hyperparameters, which greatly influence performance. Unfortunately, effective settings can only be identified through trial and error, training the network with different settings and evaluating its performance against the validation dataset (validation error, validation sets).

Throughout this article, we’ll be diving deep into the world of Bayesian Optimization, exploring its practical uses, especially when it comes to fine-tuning parameters in machine learning. You’ll learn how Bayesian Optimization can significantly improve the performance of machine learning models through smart and efficient hyperparameter optimization (bayesian optimization hyperparameter tuning).

Introduction
What Is Bayesian Optimization?
How Does Bayesian Optimization Work?
Objective Function For Optimization
- Bayesian Optimization for Hyperparameter Tuning
Bayesian Optimization Python – Code Example
Applications Of Bayesian Optimization
Challenges with Bayesian Optimization
Conclusion
References

Also Read: What is the Adam Optimizer and How is It Used in Machine Learning

What Is Bayesian Optimization?

Global optimization presents a challenging problem, as it requires finding the minimum or maximum cost of a given objective function. Typically, these objective functions are complex, non-convex, non-linear, and even computationally expensive to evaluate. Bayesian Optimization provides a principled method based on Bayes’s Theorem for addressing global optimization problems in a highly efficient and effective manner.

The formula for Bayes’ theorem is:

where:

P(A | B) is the posterior probability of event A occurring, given that event B has occurred.
P(B | A) is the likelihood of event B occurring, given that event A has occurred.
P(A) is the prior probability of event A occurring.
P(B) is the marginal probability of event B occurring.

‘Bayes’ theorem allows us to update our beliefs (the prior probability) about event A when we have new evidence (event B). The result is the posterior probability, which represents our updated belief about event A after considering the new evidence.

Bayesian optimization offers a strong alternative to traditional techniques for fine-tuning hyperparameters, like random search and grid search. While these methods can be computationally demanding, Bayesian optimization intelligently navigates the search space to identify optimal hyperparameters in a more focused approach.

Machine learning models, such as decision tree and deep learning frameworks, can substantially benefit from employing Bayesian optimization in their hyperparameter tuning process.

Bayesian Optimization constructs a surrogate model for the objective function, quantifies the uncertainty in that surrogate using a Bayesian machine learning technique called Gaussian Process Regression (Bayesian optimization Gaussian process), and employs an acquisition function to determine the most promising sampling locations.

The Gaussian process models help in capturing the uncertainty in the surrogate model, making the optimization process more robust.

Bayesian optimization allows us to relax these assumptions and can deliver an impressive performance when optimizing complex “black box” objectives with limited observation budgets. Its success spans science, engineering, and beyond, including hyperparameter tuning impacting fields like:

Automatic machine learning
Reinforcement learning
Robotics
Environmental monitoring
Information extraction
Combinatorial optimization
Sensor networks

One of the key aspects of Bayesian optimization is the use of acquisition functions, such as the probability of improvement. These functions help balance exploration and exploitation in the optimization process.

Source: YouTube

How Does Bayesian Optimization Work?

Bayesian optimization is an effective and efficient global optimization method for black-box functions, especially valuable for tuning hyper-parameters in machine-learning models. It leverages probabilistic models to make intelligent decisions about which points in the search space to sample next (search space), minimizing the number of function evaluations required (function evaluations).

Some of the important concepts in Bayesian optimization are Gaussian Processes, standard deviations, computational cost, acquisition functions, and function evaluations. Below is a detailed explanation of the Bayesian optimization process, with key concepts, equations, and steps:

Gaussian Process (GP) Regression: Bayesian optimization uses Gaussian process regression to model the unknown function. A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

It is defined by a mean function m(x) and a covariance function k(x, x’), where x and x’ are points in the input space.

Covariance Function: The covariance function, also known as the kernel function, measures the similarity between points in the input space. A commonly used kernel in Bayesian optimization is the Radial Basis Function (RBF) kernel, defined as:

k(x, x’) = exp(-||x – x’||^2 / (2l^2))

where l is the length scale, a hyper-parameter controlling the smoothness of the GP model.

Posterior Distribution: Given a set of observed data points (X, y), we can compute the posterior distribution of the function value f(x*) at a new input x* using the Gaussian process model. The posterior mean and covariance are given by:

μ(x*) = k(x*, X) (K(X, X) + σ^2 I)^{-1} y

σ^2(x*) = k(x*, x*) – k(x*, X) (K(X, X) + σ^2 I)^{-1} k(X, x*)

where k(x*, X) is the vector of covariances between x* and the observed inputs X, K(X, X) is the matrix of pairwise covariances between observed inputs, and σ^2 is the observation noise.

Acquisition Functions: To decide which point in the search space to sample next, Bayesian optimization uses acquisition functions, which balance exploration (sampling points with high uncertainty) and exploitation (sampling points with high predicted function values). Common acquisition functions include:

1) Expected Improvement (EI)

Expected Improvement is a popular acquisition function in Bayesian optimization. Given a Gaussian Process (GP) model with a posterior mean function μ(x) and standard deviation σ(x), EI is defined as:

EI(x) = (μ(x) – f(x_best) – ξ)Φ(Z) + σ(x)φ(Z)

where:

x_best is the best-known function value,
ξ is a small positive number to encourage exploration,
Φ(Z) and φ(Z) are the cumulative distribution function (CDF) and probability density function (PDF) of the standard normal distribution, respectively, and

Z = (μ(x) – f(x_best) – ξ) / σ(x) if σ(x) > 0, else Z = 0.

2) Probability of Improvement (PI)

The probability of Improvement is defined as:

PI(x) = Φ((μ(x) – f(x_best) – ξ) / σ(x))

where ξ is a small positive number to encourage exploration, and Φ(Z) is the CDF of the standard normal distribution.

3) Upper Confidence Bound (UCB)

Upper Confidence Bound is an alternative acquisition function that balances exploration and exploitation. It is defined as:

UCB(x) = μ(x) + κσ(x)

where κ is a positive constant controlling the trade-off between exploration (larger κ) and exploitation (smaller κ).

Comparison plot of acquisition functions in Bayesian Optimization (Expected Improvement, Probability of imporvement and Upper Confidence Bound) https://chart-studio.plotly.com/~aayushmittalaayush/10

where x_best is the best-observed input, ξ is a trade-off parameter, and κ controls the exploration-exploitation balance.

Optimization: Bayesian optimization iteratively updates the GP model with new observations and optimizes the acquisition function to select the next point to sample. This process continues until a stopping criterion is met, such as a maximum number of iterations or convergence of the acquisition function.

The above series of plots demonstrates Bayesian optimization using a one-dimensional function. Our goal is to find the function’s minimum value through Bayesian optimization, employing a Gaussian process regression model as a surrogate and the upper confidence bound as the acquisition function.

The optimization process is visualized through a series of plots across multiple iterations, illustrating the algorithm’s progress. Here’s a summary of the insights from these plots:

Iteration 1:
Initially, the algorithm evaluates a few points to build the surrogate model. The first plot displays the actual function, initial observations, Gaussian process mean, and uncertainty, along with the upper confidence bound to indicate high-potential regions for exploration.

Objective function:

f(x) = −sin(3x) − x2 + 0.7x

Iteration 2:
After updating the Gaussian process model with new observations, the updated model better approximates the true function. The uncertainty region narrows, and the algorithm selects the next point based on the updated upper confidence bound.

Iteration 3:
As more observations are added, the Gaussian process model is refined, focusing more on exploiting acquired knowledge. The updated upper confidence bound guides the next sample point towards regions with a high chance of improvement.

where μ is the mean predicted by the Gaussian process model, and is a tunable parameter controlling the exploration-exploitation trade-off.

Iteration 4:

The final iteration sees the algorithm converge to the global minimum. The Gaussian process model closely approximates the true function, and the upper confidence bound effectively navigates the search space.

Throughout the process, Bayesian optimization intelligently balances exploration and exploitation using the Gaussian process model and upper confidence bound. This results in efficient optimization, particularly for expensive or time-consuming black-box functions commonly found in various scientific fields.

During the training process, the decision rule is influenced by factors like posterior probability and posterior distribution, both of which are impacted by the regularization term. To evaluate performance, the learning rate is assessed using the training dataset, with empirical comparisons frequently conducted to gauge the model’s effectiveness.

Also Read: How Can Artificial Intelligence Improve Resource Optimization

Objective Function For Optimization

Bayesian Optimization for Hyperparameter Tuning

Machine learning models often have numerous hyper-parameters that significantly impact their performance. Manual tuning of these hyper-parameters is time-consuming and requires expert knowledge. Bayesian optimization is a powerful technique for automating the hyper-parameter tuning process, enabling data scientists and machine learning practitioners to build more effective models with less effort.

Hyper-parameter Tuning: Bayesian optimization is widely used for hyper-parameter tuning in machine learning. The Scikit-Optimize library provides the BayesSearchCV function, which simplifies the process. To use BayesSearchCV, import the required packages, instantiate BayesSearchCV with your model and hyper-parameter search space, and call the .fit() method to train and optimize the model.

Bayesian optimization can be applied to tune hyperparameters for a wide range of machine learning models, such as neural networks, support vector machines, decision trees, and ensemble methods.

Using Bayesian optimization for hyperparameter tuning involves the following steps:

Define the search space: Specify the range and types of hyperparameters to be optimized. This could include continuous, discrete, or categorical variables.
Define the objective function: The objective function in this case is the performance of the machine learning model on a validation dataset, given a specific set of hyperparameter values.
Perform Bayesian optimization: Follow the Bayesian optimization process outlined earlier to find the optimal hyperparameter values that maximize the performance of the machine learning model.
Train the final model: Once the optimal hyperparameter values have been identified, train the final model using these values and evaluate its performance on a test dataset.

By automating the hyperparameter tuning process, Bayesian optimization can significantly improve the performance of machine learning models and accelerate the development of state-of-the-art solutions in various domains.

Bayesian Optimization Python – Code Example

In this example, we demonstrate the use of the Scikit-optimize library for Bayesian optimization to optimize a sample function with Gaussian processes. Our objective function to optimize is f(x) = sin(5 * x) * (1 – tanh(x^2)), with the aim of finding the maximum value of this function in the range [-2, 2].

We then perform Bayesian optimization using Scikit-optimize’s gp_minimize function. The function takes several parameters, such as the objective function (wrapped in a lambda function to return the negative value for maximization), the bounds of the search space, and other settings related to the optimization process:

After the optimization process is completed, we plot the surrogate function (the Gaussian process approximation of the objective function) along with the 95% confidence interval, the true function, and the sampled points:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.base import clone
from skopt import gp_minimize
from skopt.learning import GaussianProcessRegressor
from skopt.learning.gaussian_process.kernels import ConstantKernel, Matern
from skopt.utils import cook_initial_point_generator

# Define the objective function
def f(x):
return np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2))

def neg_f(x):
return -f(np.array(x))

# Define the bounds of the input space
bounds = [(-2.0, 2.0)]

# Set the noise level
noise = 0.1

# Create a custom kernel and estimator to match the previous example
m52 = ConstantKernel(1.0) * Matern(length_scale=1.0, nu=2.5)
gpr = GaussianProcessRegressor(kernel=m52, alpha=noise ** 2)

r = gp_minimize(neg_f,
bounds,
base_estimator=gpr,
acq_func='EI', # expected improvement
xi=0.01, # exploitation-exploration trade-off
n_calls=25, # number of iterations
n_random_starts=5, # initial random samples
random_state=42) # random seed for reproducibility

# Plot the fitted model and the noisy samples
X = np.linspace(-2, 2, 400).reshape(-1, 1)
y_pred, sigma = r.models[-1].predict(X, return_std=True)

plt.figure(figsize=(10, 5))
plt.plot(X, np.array([f(x) for x in X]), 'r:', label=r'$f(x) = x\,\sin(5x)

Bayesian Optimization with Gaussian Process Regression: This plot showcases the surrogate function (blue) approximating the true objective function (red-dotted) along with the 95% confidence interval (shaded blue area). The surrogate function effectively captures the overall trend and assists in optimizing the objective function.

Convergence of Bayesian Optimization: This plot illustrates the progress of the optimization process over successive iterations, highlighting how the minimum function value discovered gradually approaches the true minimum, showcasing the effectiveness of the Bayesian optimization technique.

Applications Of Bayesian Optimization

Bayesian optimization has proven to be an effective tool for a wide range of applications, spanning various industries and research areas. Some notable applications include

Automated Machine Learning (AutoML): Bayesian optimization has been successfully used in frameworks designed to automatically find optimal machine learning models without human intervention. These AutoML systems leverage training and validation datasets to search for the best algorithm and hyperparameters within the given search space.

Hyperparameter Optimization: As one of the most popular applications of Bayesian optimization, it is extensively used to tune hyperparameters of machine learning models, including deep neural networks, support vector machines, and random forests, to achieve optimal performance. Bayesian optimization can be used to measure similarities between unseen datasets and historical datasets, enabling the transfer of initializations for Bayesian hyperparameter optimization. This approach helps to warm-start the optimization process, leading to faster convergence and improved performance.

Bayesian optimization (BO) has made substantial progress in addressing complex problems in engineering and materials science. Aerospace engineers utilize this to optimize expensive-to-evaluate functions and manage inequality constraints while maximizing improvements within a limited evaluation budget. Meanwhile, materials scientists have applied Bayesian optimization to optimize structure-property relationships in ferromagnetic thin films like FeGaB and FeGaC. By guiding experiments, Bayesian optimization can potentially reduce the number of samples required by up to 50% compared to traditional methods, saving both time and resources. This versatile approach demonstrates the broad impact of Bayesian optimization across various domains.

In the realm of black-box adversarial attacks, researchers are leveraging Bayesian optimization to generate adversarial examples with limited information, specifically focusing on scenarios with low query budgets.

The other applications include:

Hyperparameter Tuning in Machine Learning
Neural Architecture Search
Drug Discovery and Materials Science
Optimization in Robotics and Control Systems

These diverse applications demonstrate the versatility and power of Bayesian optimization in solving complex optimization problems across various domains.

Challenges with Bayesian Optimization

While Bayesian optimization has proven to be a powerful optimization technique, it also comes with some challenges:

High-Dimensional Objective Functions

Bayesian optimization has been found to struggle with high-dimensional objective functions, particularly those with more than 20 dimensions. This limitation can hinder its performance in complex problems.

Computational Cost

Gaussian process regression is employed to model the surrogate function in Bayesian optimization. However, it can be computationally expensive for large datasets or high-dimensional search spaces, leading to increased processing time.

Noisy Samples

Real-world applications often involve noisy samples in the data. This noise can adversely affect the performance of Bayesian optimization, making it less reliable in such scenarios.

Choice of Acquisition Function

The acquisition function plays a crucial role in the optimization process, and selecting the appropriate one is vital. However, there is no universal solution, and the choice often depends on the specific problem being addressed.

Conclusion

In summary, Bayesian optimization is a powerful, versatile technique for addressing complex, high-dimensional design problems across various fields, including engineering, pharmaceuticals, software development, and machine learning. It uses Bayes’s theorem, probability models, and specific functions to tackle optimization problems with black-box functions that are expensive to compute.

There are some challenges when dealing with high-dimensional functions, the computational cost, noisy samples, and selecting the best acquisition function. As researchers continue to refine and develop new methods to overcome these challenges, Bayesian optimization holds great promise for the future and innovation across many domains.

References

“Gaussian Process Regression.” Plotly, https://chart-studio.plotly.com/~aayushmittalaayush/6. Accessed 15 Apr. 2023.

Design of the 2015 ChaLearn AutoML challenge. IEEE Xplore. https://ieeexplore.ieee.org/document/7280767

Application of Bayesian optimization and regression analysis to ferromagnetic materials development. IEEE Xplore. https://ieeexplore.ieee.org/document/9599674

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & de Freitas, N. (2016b). Taking the Human Out of the Loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), 148–175. https://doi.org/10.1109/jproc.2015.2494218

Shukla, S. N., Sahu, A. K., Willmott, D., & Kolter, J. Z. (2019, September 30). Black-box adversarial attacks with bayesian optimization. arXiv.Org. https://arxiv.org/abs/1909.13857

Reference: Snoek, J., Larochelle, H., Adams, R.P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems

Application of bayesian optimization and regression analysis to ferromagnetic materials development. (n.d.). IEEE Xplore. Retrieved April 11, 2023, from https://ieeexplore.ieee.org/document/9599674