Abstract: We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer and the total number of iterations N. This bound can be made arbitrarily small: Adam with a learning rate α=1/√N and a momentum parameter on squared gradients β2=1−1/N achieves the same rate of convergence O(ln(N)/√N) as Adagrad. Finally, we obtain the tightest dependency on the heavy ball momentum among all previous convergence bounds for non-convex Adam and Adagrad, improving from O((1−β1)−3) to O((1−β1)−1). Our technique also improves the best known dependency for standard SGD by a factor 1−β_1.
tmlr-defossez-2022.djvu tmlr-defossez-2022.pdf tmlr-defossez-2022.ps.gz
@article{defossez-2022, title = {A Simple Convergence Proof of Adam and Adagrad}, author = {D{\'e}fossez, Alexandre and Bottou, Leon and Bach, Francis and Usunier, Nicolas}, journal = {Transactions on Machine Learning Research}, issn = {2835-8856}, year = {2022}, url = {http://leon.bottou.org/papers/defossez-2022}, }