Understanding the New AdaBelief Optimizer

We discuss the AdaBelief optimizer, explore how it works, and create a bare-bones implementation in PyTorch.

Frank Odom

Published in

The DL

5 min readOct 25, 2020

Introduction

All machine learning algorithms use optimization in one form or another. So, when new advancements are made in optimization research, we should pay attention. AdaBelief represents a very significant advancement in gradient-based optimization for deep learning. Its authors show that it outperforms other deep learning-focused optimizers across a wide variety of applications. The implications here are tremendous:

Faster training convergence
Greater training stability
Better model generalization
Increased model accuracy

We expect that, without any additional changes, existing state-of-the-art deep learning models will improve just by using AdaBelief during training.

Stochastic Gradient Descent

In order to understand AdaBelief, we first need to understand the basics of stochastic, gradient-based optimizers. Nearly all deep learning optimizers fall into this category, and AdaBelief is no different. So, let’s start at the beginning: Stochastic Gradient Descent.

Stochastic Gradient Descent (SGD) is the original gradient-based optimizer. It is easy to implement, strongly theoretically grounded, extremely stable during training, and gives competitive results with many other, advanced optimizers. The idea is simple: compute the gradient for each parameter, and take a small step in the direction of the gradient. If we do this many times, using randomly-selected (stochastic) batches of samples from the training data, our model will gradually improve until it reaches a minimum.

Historically, the most difficult part of SGD was computing gradients for all model parameters. But modern deep learning libraries (e.g. PyTorch, Tensorflow) can compute gradients for us automatically, so our job is much easier! Let’s jump right in, and write a PyTorch implementation of SGD.

Most of this is just boilerplate code, which we need to conform to the Optimizer parent class. The meaningful part of SGD is contained in just a few lines, as shown below.

Adam

There’s one major problem with SGD: it converges too slowly, especially in the early parts of training. We have to perform a large number of updates before the model starts to converge. This costs precious time and computational resources. Many variants of SGD exist, which attempt to address this problem. Among them, the Adaptive Moment Estimation (Adam) optimizer is likely the most popular and well known.

Adam introduces two internal states for each parameter: momentum (m) and squared momentum (v) of the gradient (g). With each training batch, each of them is updated using exponential weighted averaging (EWA):

where the β values are provided as hyperparameters. These are then used to update the parameters for each step:

where α is the learning rate, and ϵ is added to improve stability. It is straightforward to implement this in code, given the work we’ve already done with SGD.

Note: For simplicity, we have omitted the bias correction terms, similar to the authors of AdaBelief. The behavior of Adam should still be identical after the first 100 steps or so.

Intuitively, why would Adam converge any faster than SGD? Exponential weighted averaging provides a smoother approximation of the gradient over time, leading to increased stability. Then, by dividing by the root momentum squared (v), our step size sharply decreases when variance of the gradient increases. This essentially means that, when it is not clear which direction to move, the Adam algorithm takes much smaller steps.

AdaBelief

Of course, Adam is not the end of the story. The authors of AdaBelief point out a significant problem with Adam, illustrated in case 3 in the figure below. When the gradient is large, but variance is small, Adam predicts a small step size. This doesn’t make intuitive sense — if the magnitude and direction of the gradient is consistent, we expect to take larger steps, because we are more confident that the step direction is correct.

AdaBelief fixes this with a very minimal change from Adam. Rather than computing the momentum squared, AdaBelief computes the variance of the gradient over time. This difference is subtle but important:

This is where the “belief” in AdaBelief comes from, because variance is computed using our current, estimated momentum. The variance is essentially the squared distance from the expected (or believed) gradient. And when variance is small, our step size remains large!

The implementation of AdaBelief is nearly identical to Adam. In fact, you could build AdaBelief by changing a single line of code from Adam. We’ll make several changes, but most of them are just cosmetic.

That’s it! In my opinion, it is shocking that such a small change creates a dramatic difference in performance. The AdaBelief authors show that it outperforms Adam, SGD, and many other optimizers across a wide variety of applications.

Source: *AdaBelief Optimizer, by Zhuang et. al.*

Conclusions

I’m very excited by the publication of AdaBelief. It’s fantastically simple, and remarkably effective. Assuming these results hold for other datasets and model types, I expect to see AdaBelief cited in many research publications in the very near future. Hopefully, it will save everyone some training time, while also boosting the performance of existing state-of-the-art models!

References

https://arxiv.org/pdf/2010.07468v1.pdf
https://arxiv.org/pdf/1412.6980.pdf
https://github.com/juntang-zhuang/Adabelief-Optimizer (official)
https://github.com/jettify/pytorch-optimizer