This is a note on sparse linear models in high dimension.

Recovery in the Noiseless Setting

We begin by focusing on the case in which the observations are perfect or noiseless. Concretely, we wish to find a solution $\theta$ to the linear system \(y = \bX\theta\), where $y \in \mathbb{R}^n$ and $X \in \mathbb{R}^{n \times d}$ are given. This is an underdetermined set of linear equations when $d > n$. Our goal is to find the sparsest solution to the linear system.

The first idea is to consider the hard-sparse optimization problem

\[\begin{equation}\tag{HS}\label{HS} \min_{\theta \in \mathbb{R}^d} \|\theta\|_0 \quad \text{such that} \quad \bX\theta = y. \end{equation}\]

However, this is a combinartorics optimization problem. In general, we need to check all subsets of cardinality $k=1,\dots,d$, which is computationally infeasible if the groundtruth sparisty is large.

A natural idea is to Algorithm: Basis Pursuit linear program

\[\begin{equation}\tag{BS}\label{BS} \min_{\theta \in \mathbb{R}^d} \|\theta\|_1 \quad \text{such that} \quad \bX\theta = y. \end{equation}\]

Define

\[\mathbb{C}(S) = \{ \Delta \in \mathbb{R}^d \mid \| \Delta_{S^c} \| _1 \leq \|\Delta_S\|_1 \}.\]

We say the matrix $\bX$ satisfies the restricted nullspace property with respect to $S$ if

\[\CC(S) \cap \mathrm{null}(\bX) = \{0\}.\]

Then the following two properties are equivalent:

  • For any vector $\theta^* \in \mathbb{R}^d$ with support $S$, the basis pursuit program $\eqref{BS}$ applied with \(y = \bX\theta^*\) has unique solution $\widehat{\theta} = \theta^*$.
  • The matrix $\bX$ satisfies the restricted nullspace property with respect to $S$.

Examples: Say $\bX\in\RR^{n\times d}$ has i.i.d. entries $\bX_{ij}\sim\cN(0,1)$. Then if $n\geq ck\log(\frac{ed}{k})$ for some sharp constant $c$, then uniform $RN(k)$ holds with high probability.

Estimation in Noisy Settings

Let us now turn to the noisy setting, in which we observe the vector–matrix pair $(y, \bX) \in \mathbb{R}^n \times \mathbb{R}^{n \times d}$ linked by the observation model $y = \bX\theta^* + w$. The new ingredient here is the noise vector $w \in \mathbb{R}^n$.

In this case, we also have a hard sparse estimator

\[\hat\theta_{\text{HS}}=\argmin_{\|\theta\|_0\leq k}\|y-\bX\theta\|_2^2\]

Can we acheive the fast rate with a computationally efficient procedure?

One can consider the constrained Lasso estimator

\[\hat{\theta}=\argmin_{\|\theta\|_1\leq R}\|y-\bX\theta\|_2^2\]

with \(R=\|\theta^*\|_1\).

In particular, for a constant $\alpha \geq 1$, let us define the set

\[\mathbb{C}_{\alpha}(S) := \{\Delta \in \mathbb{R}^d \mid \|\Delta_{S^c}\|_1 \leq \alpha \|\Delta_S\|_1\}.\]

The matrix $\bX$ satisfies the restricted eigenvalue (RE) condition over $S$ with parameters $(\gamma, \alpha)$ if

\[\frac{1}{n} \|\bX\Delta\|_2^2 \geq \gamma \|\Delta\|_2^2 \quad \text{for all} \quad \Delta \in \CC_{\alpha}(S).\]

We show that there are many matrices $\bX\in\RR^{n\times d}$ that satisfy RE/RN conditions.

Finally, we show that if $\bX$ satisfies RE, then the constrained Lasso estimator can achieve the fast rate.

Theorem: If $\bX$ satisfies RE($\gamma$), then

\[\frac{\EE \|X(\hat\theta-\theta^*)\|_2^2}{n}\leq \frac{\sigma^2}{\gamma}\frac{k\log d}{n}\]

The proof goes as follows. As \(R=\|\theta^*\|_1\), $\theta^*$ is feasible and basic inequality applies. Denote \(\hat\Delta=\hat\theta-\theta^*\), we have

\[\begin{align*} \frac{\|\bX \hat{\Delta}\|_2^2}{n} &\leq \frac{1}{n}\left\langle w,\bX\hat{\Delta}\right\rangle \\ &= \frac{1}{n}\left\langle \hat{\Delta},\bX^\top w\right\rangle \\ &\leq \|\hat{\Delta}\|_1\left\|\frac{\bX^\top w}{n}\right\|_\infty \end{align*}\]

as \(\|\hat\theta\|_1\leq \|\theta^*\|_1\), the error belongs to the cone $\CC_1(S)$, whence

\[\|\hat\Delta\|_1= \|\hat\Delta_S\|_1 + \|\hat\Delta_{S^c}\|_1\leq 2 \|\hat\Delta_S\|_1 \leq 2\sqrt{k}\|\hat\Delta\|_2\]