Anytime Acceleration of Gradient Descent
This is my note on the paper Anytime Acceleration of Gradient Descent.
Introduction
In my previous note Classical Gradient Descent Analysis, we have encountered the analysis of gradient descent with fixed stepsize $\eta$ on
\[\min_{x\in\RR^d} f(x)\]where $f$ is smooth and convex. Here the question is, if the stepsize $\eta_t$ is not fixed, is it possible to achieve a faster rate then $1/t$?
Basic Inequality for Smooth Convex Functions
For a $L$-smooth convex function, we first establish the following inequality
\[f(y)\geq f(x)+\left\langle \nabla f(x), y-x\right\rangle +\frac{1}{2L}\|\nabla f(y)-\nabla f(x)\|^2\]This can be proved by noting that
\[g_x(y):=f(x)+\left\langle \nabla f(x), y-x\right\rangle\]is also a convex $L$-smooth function and $x$ is its global minimum. Then
\[g_x(x)\leq g_x\left(y-\frac{\nabla g_x(y)}{L}\right).\]From an optimization point of view, co-coercivities
\[Q_{ij}:=2(f_i-f_j)+2\left\langle g_j, x_j-x_i\right\rangle -\|g_j-g_i\|^2.\]generate all possible long-range consistency constraints on the objective function $f$. In other words, the co-coercivity conditions generate all possible valid inequalities with which one can prove convergence rates for GD.
Intuition: Two-Step Case
Acceleration by Stepsize Hedging I: Multi-Step Descent and the Silver Stepsize Schedule
Silver Stepsize Schedule
In Acceleration by Stepsize Hedging II: Silver Stepsize Schedule for Smooth Convex Optimization,
where $n=2^k-1$, $r_k=\frac{1}{4\rho^{2k}-3}$, and the learning
The authors exhibit explicit non-negative multipliers $\lambda_{ij}$ satisfying the following identity
\[\sum_{i\neq j\in\{0,\cdots,n,*\}}\lambda_{ij}Q_{ij}=\|x_0-x^*\|^2-\|x_n-\frac{g_n}{2r_k} -x^*\|^2+\frac{f^*-f_n}{r_k}.\]This immediately implies
\[f_n-f^*\leq r_k \|x_0-x^*\|^2.\]When $n=0$, the identity is
\[\lambda_{*0}Q_{*0}+\lambda_{0*}Q_{0*}=\|x_0-x^*\|^2-\|x_0-g_0 -x^*\|^2+2(f^*-f_0)\]which holds if $\lambda_{0}=1$ and $\lambda_{0}=0$.
When $n=1$,
For larger horizons $n$, we construct $\lambda_{ij}$ via recursive gluing.