Frank Odom
1 min readJan 7, 2021

--

There are no guarantees that any optimizer will *always* outperform another, but AdaBelief outperformed RAdam on every benchmark in the paper! I understand your concerns with learning rate scheduling -- it's another thing to worry about for training the network. But the same learning rate schedule is used for every experiment, and it's not particularly difficult to set up. (The `torch.optim.lr_scheduler` module contains everything you need and more.)

Changing the schedule for RAdam should have a smaller effect. By design, RAdam is less sensitive to learning rate, so I'm skeptical that will bridge the performance gap with AdaBelief.

It's possible that you're right about training with RAdam initially, and shifting to AdaBelief after the lr decrease. But I think there's a decent chance that doesn't work as expected -- RAdam may position you differently in parameter space, such that AdaBelief doesn't perform as well at the lower learning rate. Maybe you could set up a small-scale experiment to test that? Would definitely be interested if it works.

--

--

No responses yet