In this article we investigate kernel ridge regression.

A function $k:\cX\times\cX\to\RR$ is saird to be a kernel if it is symmetry and positive semidefinite.

Consider a kernel $k:\cX\times\cX\to\RR$.

Mercer’s Decomposition

Consider the operator $\cT:L^2(\rho)\to L^2(\rho)$ defined by

\[(\cT f)(x)=\int k(x,x')f(x')\mathrm{d}\rho (x').\]

Assume that

Under this condition, we have the decomposition

\[k(x,x')=\sum_{j=1}^\infty \mu_j e_j(x)e_j(x'),\]

where $\{\mu_j\}_{j=1}^\infty$ are the eigenvalues in a decreasing order, and $\{e_j\}_{j=1}^\infty$ are the corresponding orthonormal eigenfunctions in $L^2(\rho)$, i.e. $\cT e_j=\mu_je_j$.

Interpolation Spaces

Let $\cH^s:=$

For all $s\geq 0$, $\cH^s$ is a well-defined Hilbert space. It is an interpolation space with $s$ quantifying the smoothness relative to $\cH$.

The larger $s$, the smaller $\cH^s$. Particularly, $\cH^1=\cH$ and $\cH^0=L^2(\rho)$. Therefore, $s>1$ means extra smoothness while $s<1$ corresponds to misspecified case.

Basic Properties of KRR

In supervised learning for regression tasks, we work with $n$ i.i.d. training samples $\{(x_i,y_i)\}_{i=1}^n$ with $x_i\in\cX$ and $y_i=f^*(x_i)+\epsilon_i$ for $i\in [n]$. Here $f^*:\cX\to\RR$ represents the target function and $\epsilon_i$ represents the label noise.

The primary objective in regression is to recover $f^*$ as accurately as possible using these samples, aiming to minimize the generalization error:

\[\|\hat{f}-f^*\|_\rho ^2 = \int |\hat{f}-f^*|^2\mathrm{d}\rho(x),\]

where $\hat{f}$ denotes our model.

Given a model class $\cH$, the best estimator $\hat{f}$ in the sense are actually the solution of the following equation

\[\]

The KRR estimators are defined by

\[\hat f_\lambda=\arg\min_{f\in\cH}\left[\frac{1}{n}\sum_{i=1}^n(f(x_i)-y_i)^2+\lambda \|f\|_\cH^2\right].\]

A closed form can be provided by the representer theorem

\[\hat f_\lambda(x)=\]

If we introduce the auxillary integral operators

As a side remark, note that when $\lambda\to 0$, $\hat{f}_\lambda$ converges to the minimum norm solution:

\[\hat f_0=\arg\min_{f\in\cH,f(x_i)=y_i, \forall i\in [n]}\|f\|_\cH.\]

Define

Kernel Ridge Regression

Mercer’s Decomposition

Interpolation Spaces

Basic Properties of KRR

Random Feature Models