13  Singular Value Decomposition and Least Squares

13.1 Motivation

The spectral theorem diagonalizes self-adjoint operators via orthonormal eigenbases. For a general linear map T : \mathcal{V} \to \mathcal{W} between different spaces, or for non-self-adjoint operators, eigenvalues may not exist or may not suffice. Yet every such map admits a canonical decomposition using orthonormal bases in the domain and codomain.

The Singular Value Decomposition (SVD). Every linear map T : \mathcal{V} \to \mathcal{W} between finite-dimensional inner product spaces has orthonormal bases (v_1, \ldots, v_n) of \mathcal{V} and (u_1, \ldots, u_m) of \mathcal{W} such that T(v_i) = \sigma_i u_i, where \sigma_i \ge 0 are the singular values of T. The map is diagonal with respect to these bases.

This generalizes diagonalization: instead of one orthonormal basis diagonalizing T, we use two bases—one in the domain, one in the codomain. The singular values \sigma_i encode the scaling along principal directions.

SVD reveals the geometry of linear maps: it decomposes any map into a rotation (by v_i), scaling (by \sigma_i), and another rotation (by u_i). It provides the best low-rank approximations, pseudo-inverses for non-square or singular matrices, and the foundation for principal component analysis (PCA).

Throughout, \mathcal{V} and \mathcal{W} are finite-dimensional inner product spaces over \mathbb{R} or \mathbb{C} with \dim \mathcal{V} = n and \dim \mathcal{W} = m.


13.2 Singular Values and the SVD Theorem

Given T : \mathcal{V} \to \mathcal{W}, we relate T to a self-adjoint operator. The composition T^* T : \mathcal{V} \to \mathcal{V} satisfies \langle T^* T(v), w \rangle = \langle T(v), T(w) \rangle = \overline{\langle T(w), T(v) \rangle} = \overline{\langle T^* T(w), v \rangle} = \langle v, T^* T(w) \rangle, which is exactly the condition \langle T^*T(v), w \rangle = \langle v, T^*T(w) \rangle, so T^* T is self-adjoint. (In the real case the conjugates are unnecessary, but the complex case requires them.) By Theorem 12.7, T^* T has an orthonormal eigenbasis with real eigenvalues.

The operator T^* T : \mathcal{V} \to \mathcal{V} is positive semi-definite: \langle T^* T(v), v \rangle \ge 0 for all v. Its eigenvalues are non-negative.

Proof. For any v \in \mathcal{V}, \langle T^* T(v), v \rangle = \langle T(v), T(v) \rangle = \|T(v)\|^2 \ge 0. If \lambda is an eigenvalue with eigenvector v \neq 0, then \lambda \|v\|^2 = \langle \lambda v, v \rangle = \langle T^* T(v), v \rangle = \|T(v)\|^2 \ge 0, so \lambda \ge 0. \square

Definition 13.1 (Singular values) Let T : \mathcal{V} \to \mathcal{W} be a linear map. The singular values of T are \sigma_1, \ldots, \sigma_r, where \sigma_i = \sqrt{\lambda_i} and \lambda_1 \ge \cdots \ge \lambda_r > 0 are the positive eigenvalues of T^* T. By convention, order them \sigma_1 \ge \cdots \ge \sigma_r > 0.

If \dim \mathcal{V} = n and T^* T has r positive eigenvalues, then T has r non-zero singular values. The rank of T equals r, since \operatorname{rank}(T) = \operatorname{rank}(T^* T).

Theorem 13.1 (Singular Value Decomposition) Let T : \mathcal{V} \to \mathcal{W} be a linear map with singular values \sigma_1 \ge \cdots \ge \sigma_r > 0. There exist orthonormal bases (v_1, \ldots, v_n) of \mathcal{V} and (u_1, \ldots, u_m) of \mathcal{W} such that \begin{align*} T(v_i) &= \sigma_i u_i \quad \text{for } i = 1, \ldots, r, \\ T(v_i) &= 0 \quad \text{for } i = r+1, \ldots, n. \end{align*}

Proof. Let \lambda_1 \ge \cdots \ge \lambda_r > 0 be the positive eigenvalues of T^* T, with orthonormal eigenvectors v_1, \ldots, v_r. Extend to an orthonormal basis (v_1, \ldots, v_n) of \mathcal{V} by adding an orthonormal basis of \ker(T^* T) = \ker(T).

For i \le r, let \sigma_i = \sqrt{\lambda_i} and define u_i = \frac{1}{\sigma_i} T(v_i). We verify (u_1, \ldots, u_r) is orthonormal. For i, j \le r, \begin{align*} \langle u_i, u_j \rangle &= \frac{1}{\sigma_i \sigma_j} \langle T(v_i), T(v_j) \rangle \\ &= \frac{1}{\sigma_i \sigma_j} \langle T^* T(v_i), v_j \rangle \\ &= \frac{1}{\sigma_i \sigma_j} \langle \lambda_i v_i, v_j \rangle \\ &= \frac{\lambda_i}{\sigma_i \sigma_j} \delta_{ij} = \delta_{ij}, \end{align*} using \lambda_i = \sigma_i^2. Thus (u_1, \ldots, u_r) is orthonormal.

By construction, T(v_i) = \sigma_i u_i for i \le r. For i > r, v_i \in \ker(T^* T) = \ker(T), so T(v_i) = 0.

Extend (u_1, \ldots, u_r) to an orthonormal basis (u_1, \ldots, u_m) of \mathcal{W}. This completes the SVD. \square

The SVD provides a canonical form for any linear map. In coordinates, if A is the matrix of T with respect to standard bases, then A = U \Sigma V^*, where U, V are orthogonal (or unitary) matrices and \Sigma is a diagonal matrix with entries \sigma_1, \ldots, \sigma_r followed by zeros.

Corollary 13.1 Every m \times n real matrix A satisfies A = U \Sigma V^T, where U \in \mathbb{R}^{m \times m} and V \in \mathbb{R}^{n \times n} are orthogonal matrices, and \Sigma \in \mathbb{R}^{m \times n} is diagonal with non-negative entries \sigma_1 \ge \cdots \ge \sigma_r > 0 on the diagonal.

Similarly, every complex matrix A satisfies A = U \Sigma V^* with unitary U, V.

13.2.1 Example: SVD of a 2×2 Matrix

Consider A = \begin{pmatrix} 3 & 0 \\ 0 & -2 \end{pmatrix}. Then A^T A = \begin{pmatrix} 9 & 0 \\ 0 & 4 \end{pmatrix}, with eigenvalues \lambda_1 = 9, \lambda_2 = 4 and eigenvectors v_1 = (1, 0)^T, v_2 = (0, 1)^T. The singular values are \sigma_1 = 3, \sigma_2 = 2.

Compute u_1 = \frac{1}{3} A v_1 = (1, 0)^T, \quad u_2 = \frac{1}{2} A v_2 = (0, -1)^T. Thus A = U \Sigma V^T with U = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}, \quad \Sigma = \begin{pmatrix} 3 & 0 \\ 0 & 2 \end{pmatrix}, \quad V = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}.


13.3 Geometric Interpretation

The SVD reveals the action of T on unit spheres. The unit sphere \{v : \|v\| = 1\} in \mathcal{V} is mapped by T to an ellipsoid in \mathcal{W}. The axes of this ellipsoid are \sigma_i u_i, with semi-axis lengths \sigma_i.

To see this, write v = \sum_{i=1}^n \alpha_i v_i with \|v\|^2 = \sum \alpha_i^2 = 1. Then T(v) = \sum_{i=1}^r \sigma_i \alpha_i u_i. The image T(v) lies in \operatorname{span}\{u_1, \ldots, u_r\} = \operatorname{range}(T) with coordinates (\sigma_1 \alpha_1, \ldots, \sigma_r \alpha_r) relative to (u_1, \ldots, u_r). The constraint \sum \alpha_i^2 = 1 becomes \sum_{i=1}^r \frac{(\sigma_i \alpha_i)^2}{\sigma_i^2} = 1, the equation of an ellipsoid with semi-axes \sigma_i.

The largest singular value \sigma_1 is the operator norm: \|T\| = \sup_{\|v\| = 1} \|T(v)\| = \sigma_1. The direction of maximum stretching is v_1; the direction of minimum stretching (among non-zero stretches) is v_r.


13.4 The Pseudo-Inverse

For non-square or singular matrices, no true inverse exists. The pseudo-inverse generalizes inversion to arbitrary matrices, providing the best approximate solution to Ax = b.

Definition 13.2 (Moore-Penrose pseudo-inverse) Let T : \mathcal{V} \to \mathcal{W} have SVD with singular values \sigma_1, \ldots, \sigma_r > 0 and orthonormal bases (v_i), (u_i). The pseudo-inverse T^\dagger : \mathcal{W} \to \mathcal{V} is defined by \begin{align*} T^\dagger(u_i) &= \frac{1}{\sigma_i} v_i \quad \text{for } i = 1, \ldots, r, \\ T^\dagger(u_i) &= 0 \quad \text{for } i = r+1, \ldots, m. \end{align*}

In matrix form, if A = U \Sigma V^T, then A^\dagger = V \Sigma^\dagger U^T, where \Sigma^\dagger has entries 1/\sigma_i on the diagonal for i \le r and zeros elsewhere.

The pseudo-inverse satisfies: 1. T T^\dagger T = T 2. T^\dagger T T^\dagger = T^\dagger 3. (T T^\dagger)^* = T T^\dagger (self-adjoint) 4. (T^\dagger T)^* = T^\dagger T (self-adjoint)

These are the Moore-Penrose conditions characterizing T^\dagger uniquely.

Theorem 13.2 For b \in \mathcal{W}, the vector \hat{x} = T^\dagger(b) minimizes \|T(x) - b\| over all x \in \mathcal{V}. Among all minimizers, \hat{x} has the smallest norm.

Proof. Write b = \sum_{i=1}^m \beta_i u_i. Then T^\dagger(b) = \sum_{i=1}^r \frac{\beta_i}{\sigma_i} v_i. Compute T(T^\dagger(b)) = \sum_{i=1}^r \beta_i u_i = P_{\operatorname{range}(T)}(b), the orthogonal projection of b onto \operatorname{range}(T). By Theorem 11.8, this minimizes \|T(x) - b\|.

For uniqueness of minimal norm, suppose x also minimizes \|T(x) - b\|. Then T(x) = P_{\operatorname{range}(T)}(b), so x - T^\dagger(b) \in \ker(T) = \operatorname{span}\{v_{r+1}, \ldots, v_n\}. Write x = T^\dagger(b) + w with w \in \ker(T). Then \|x\|^2 = \|T^\dagger(b)\|^2 + \|w\|^2 \ge \|T^\dagger(b)\|^2, with equality iff w = 0. Thus \hat{x} = T^\dagger(b) has minimal norm. \square

The pseudo-inverse provides the least squares solution directly: for Ax = b, the solution is \hat{x} = A^\dagger b.


13.5 Low-Rank Approximations

Given a matrix A \in \mathbb{R}^{m \times n} with rank r, we seek the best rank-k approximation for k < r. “Best” means minimizing \|A - B\| over all matrices B of rank at most k, using the Frobenius norm \|A\|_F = \sqrt{\sum_{ij} a_{ij}^2} or the operator norm \|A\| = \sigma_1(A).

Theorem 13.3 (Eckart-Young-Mirsky theorem) Let A = U \Sigma V^T be the SVD with singular values \sigma_1 \ge \cdots \ge \sigma_r > 0. For k < r, define A_k = \sum_{i=1}^k \sigma_i u_i v_i^T. Then A_k minimizes \|A - B\| over all matrices B with \operatorname{rank}(B) \le k, and \|A - A_k\| = \sigma_{k+1}.

Proof. Write A = \sum_{i=1}^r \sigma_i u_i v_i^T. Then A_k has rank k, and A - A_k = \sum_{i=k+1}^r \sigma_i u_i v_i^T. Since (u_i v_i^T) are orthogonal operators, \|A - A_k\|^2 = \sum_{i=k+1}^r \sigma_i^2. For the operator norm, \|A - A_k\| = \sigma_{k+1} follows from the singular values of A - A_k being \sigma_{k+1}, \ldots, \sigma_r.

To show A_k is optimal, suppose \operatorname{rank}(B) \le k. Then \ker(B) has dimension at least n - k. The span \operatorname{span}\{v_1, \ldots, v_{k+1}\} has dimension k+1, so there exists nonzero v \in \operatorname{span}\{v_1, \ldots, v_{k+1}\} \cap \ker(B) with \|v\| = 1. Write v = \sum_{i=1}^{k+1} \alpha_i v_i with \sum \alpha_i^2 = 1.

Then B(v) = 0 and \begin{align*} \|A - B\| &\ge \|(A - B)(v)\| = \|A(v)\| \\ &= \left\| \sum_{i=1}^{k+1} \sigma_i \alpha_i u_i \right\| \\ &= \sqrt{\sum_{i=1}^{k+1} \sigma_i^2 \alpha_i^2} \ge \sigma_{k+1}, \end{align*} using \sigma_i \ge \sigma_{k+1} for i \le k+1. Thus \|A - B\| \ge \sigma_{k+1} = \|A - A_k\|. \square

This result underlies data compression and dimensionality reduction: keeping the top k singular values captures the most important structure in A. The truncated SVD A_k is the best rank-k approximation in both operator and Frobenius norms.

13.5.1 Example: Image Compression

Represent a grayscale image as an m \times n matrix A with pixel intensities. Compute the SVD A = U \Sigma V^T. For k \ll \min(m, n), the approximation A_k = \sum_{i=1}^k \sigma_i u_i v_i^T requires storing only k(m + n + 1) numbers instead of mn.

If \sigma_{k+1} is small, A_k visually approximates A well. Typically, k = 20 captures the main features of an image, achieving significant compression.


13.6 Principal Component Analysis

Given data points x_1, \ldots, x_N \in \mathbb{R}^n, arranged as columns of a matrix X \in \mathbb{R}^{n \times N}, we seek directions of maximum variance. Principal Component Analysis (PCA) identifies these directions using the SVD.

Assume the data is centered: \sum_{j=1}^N x_j = 0. The covariance matrix is C = \frac{1}{N} X X^T \in \mathbb{R}^{n \times n}. This is symmetric and positive semi-definite. The eigenvectors of C are the principal components; the eigenvalues measure variance along these directions.

Compute the SVD X = U \Sigma V^T. Then C = \frac{1}{N} U \Sigma^2 U^T, so U diagonalizes C. The columns u_1, \ldots, u_n of U are the principal components, ordered by decreasing eigenvalues \lambda_i = \sigma_i^2 / N.

The first principal component u_1 points in the direction of maximum variance. Projecting the data onto u_1 gives the one-dimensional representation with maximum preserved variance. For dimensionality reduction, project onto the top k principal components: \tilde{x}_j = \sum_{i=1}^k \langle x_j, u_i \rangle u_i. This minimizes the reconstruction error \sum \|x_j - \tilde{x}_j\|^2 among all k-dimensional projections.

PCA via SVD is numerically stable and efficient. It underpins applications in machine learning, signal processing, and scientific computing.


13.7 Least Squares Problems

The least squares problem seeks \hat{x} \in \mathbb{R}^n minimizing \|Ax - b\| for given A \in \mathbb{R}^{m \times n} and b \in \mathbb{R}^m. When Ax = b has no exact solution (overdetermined system with m > n), we find the best approximation.

This problem pervades applied mathematics: fitting curves to data, parameter estimation, signal reconstruction. The solution connects orthogonal projections, the pseudo-inverse, and QR decomposition.

Theorem 13.4 (Least squares solution) The vector \hat{x} \in \mathbb{R}^n minimizes \|Ax - b\| if and only if A^T A \hat{x} = A^T b. If A has full column rank, the solution is unique: \hat{x} = (A^T A)^{-1} A^T b.

Proof. Write b = P + r where P = A\hat{x} \in \operatorname{Col}(A) is the projection of b onto \operatorname{Col}(A) and r \perp \operatorname{Col}(A). By the projection theorem (Theorem 11.8 in Chapter 11), \hat{x} minimizes \|Ax - b\| iff b - A\hat{x} \perp \operatorname{Col}(A).

This orthogonality condition states: for all y \in \mathbb{R}^n, \langle Ay, b - A\hat{x} \rangle = 0. Equivalently, A^T(b - A\hat{x}) = 0, yielding the normal equations A^T A \hat{x} = A^T b.

If A has full column rank, \ker(A) = \{0\}, so \ker(A^T A) = \{0\} and A^T A is invertible. Thus \hat{x} = (A^T A)^{-1} A^T b is the unique solution. \square

The matrix (A^T A)^{-1} A^T is the left pseudo-inverse of A when A has full column rank. It generalizes matrix inversion to rectangular matrices.

13.7.1 Example: Linear Regression

Given data points (t_1, y_1), \ldots, (t_m, y_m), fit a line y = c_0 + c_1 t. Minimize \sum_{i=1}^m (y_i - c_0 - c_1 t_i)^2 = \|Ax - b\|^2, where A = \begin{pmatrix} 1 & t_1 \\ \vdots & \vdots \\ 1 & t_m \end{pmatrix}, \quad x = \begin{pmatrix} c_0 \\ c_1 \end{pmatrix}, \quad b = \begin{pmatrix} y_1 \\ \vdots \\ y_m \end{pmatrix}.

The normal equations A^T A x = A^T b yield \begin{align*} A^T A &= \begin{pmatrix} m & \sum t_i \\ \sum t_i & \sum t_i^2 \end{pmatrix}, \quad A^T b = \begin{pmatrix} \sum y_i \\ \sum t_i y_i \end{pmatrix}. \end{align*} Solving gives the least squares regression line.


13.8 QR Method for Least Squares

Computing (A^T A)^{-1} is numerically unstable when A^T A is ill-conditioned (near-singular). The QR decomposition provides a more stable algorithm.

Compute A = QR where Q \in \mathbb{R}^{m \times m} is orthogonal and R \in \mathbb{R}^{m \times n} is upper triangular. Partition R = \begin{pmatrix} R_1 \\ 0 \end{pmatrix} with R_1 \in \mathbb{R}^{n \times n} upper triangular. Then \|Ax - b\|^2 = \|QRx - b\|^2 = \|Rx - Q^T b\|^2, using orthogonality \|Qv\| = \|v\|. Write Q^T b = \begin{pmatrix} c \\ d \end{pmatrix} with c \in \mathbb{R}^n, d \in \mathbb{R}^{m-n}. Then \|Rx - Q^T b\|^2 = \|R_1 x - c\|^2 + \|d\|^2. Minimizing over x requires R_1 x = c, solved by back-substitution since R_1 is upper triangular. The minimum residual is \|d\|.

Corollary 13.2 If A = QR with Q orthogonal and R = \begin{pmatrix} R_1 \\ 0 \end{pmatrix} with R_1 upper triangular and invertible, then the least squares solution is \hat{x} = R_1^{-1} c, where c consists of the first n components of Q^T b.

This method avoids forming A^T A, improving numerical stability. The condition number of R_1 equals that of A, whereas the condition number of A^T A is the square of A’s condition number, amplifying errors.


13.9 Connection to Pseudo-Inverse and SVD

The pseudo-inverse (Section 13.4) provides another perspective on least squares. By Theorem 13.2, \hat{x} = A^\dagger b minimizes \|Ax - b\| with minimal norm among all minimizers.

Using the SVD A = U \Sigma V^T, we have A^\dagger = V \Sigma^\dagger U^T, so \hat{x} = V \Sigma^\dagger U^T b = \sum_{i=1}^r \frac{\langle b, u_i \rangle}{\sigma_i} v_i. This expresses the solution as a sum over singular vectors, weighted by reciprocal singular values. Small singular values amplify errors in b, indicating ill-conditioning.

The SVD also enables regularized least squares (Tikhonov regularization): minimize \|Ax - b\|^2 + \lambda \|x\|^2 for \lambda > 0. This trades accuracy for stability, preventing large \|x\| when A is ill-conditioned. The solution is \hat{x}_\lambda = (A^T A + \lambda I)^{-1} A^T b = \sum_{i=1}^r \frac{\sigma_i}{\sigma_i^2 + \lambda} \langle b, u_i \rangle v_i. The factor \sigma_i / (\sigma_i^2 + \lambda) damps contributions from small singular values, stabilizing the solution.


13.10 Applications to Curve Fitting

Least squares extends to polynomial and nonlinear fitting. Given data (t_1, y_1), \ldots, (t_m, y_m), fit a polynomial p(t) = c_0 + c_1 t + \cdots + c_n t^n of degree n < m. Minimize \sum_{i=1}^m (y_i - p(t_i))^2 = \|Ax - b\|^2, where A = \begin{pmatrix} 1 & t_1 & t_1^2 & \cdots & t_1^n \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & t_m & t_m^2 & \cdots & t_m^n \end{pmatrix}, \quad x = \begin{pmatrix} c_0 \\ \vdots \\ c_n \end{pmatrix}. This is the Vandermonde matrix. The normal equations yield the polynomial coefficients.

For general basis functions \phi_1(t), \ldots, \phi_n(t), fit f(t) = \sum c_j \phi_j(t) by setting A_{ij} = \phi_j(t_i). Orthogonal bases \{\phi_j\} simplify A^T A, accelerating computation.

13.10.1 Example: Exponential Decay

Fit y = c e^{-\lambda t} to data by linearizing: take logarithms \ln y = \ln c - \lambda t. This reduces to linear regression in (\ln c, -\lambda). However, errors in \ln y are not Gaussian if errors in y are, violating least squares assumptions. Nonlinear least squares (minimizing \sum (y_i - c e^{-\lambda t_i})^2 over c, \lambda) is more principled but requires iterative methods.

13.10.2 Example: Trigonometric Interpolation

Given data points (t_1, y_1), \ldots, (t_{2n+1}, y_{2n+1}) with t_i = 2\pi i / (2n+1), fit a trigonometric polynomial f(t) = a_0 + \sum_{k=1}^n (a_k \cos(kt) + b_k \sin(kt)). The design matrix has columns 1, \cos(t_i), \sin(t_i), \cos(2t_i), \sin(2t_i), \ldots These form an orthogonal system under summation \sum_{i=1}^{2n+1}, so A^T A is diagonal. The solution reduces to a_k = \frac{2}{2n+1} \sum_{i=1}^{2n+1} y_i \cos(kt_i), \quad b_k = \frac{2}{2n+1} \sum_{i=1}^{2n+1} y_i \sin(kt_i). These are discrete Fourier coefficients, computed efficiently via the Fast Fourier Transform (FFT).

13.10.3 Example: Spline Interpolation

For smooth curve fitting, use cubic splines. Given knots t_1 < \cdots < t_m, construct piecewise cubic polynomials s_i(t) on [t_i, t_{i+1}] satisfying continuity of s, s', s'' at knots and boundary conditions. The coefficients solve a least squares problem with equality constraints, reduced via Lagrange multipliers or direct elimination.


13.11 Weighted Least Squares

When measurements have different precisions, assign weights. Minimize \sum_{i=1}^m w_i (y_i - (Ax)_i)^2 = \|W^{1/2}(Ax - b)\|^2, where W = \operatorname{diag}(w_1, \ldots, w_m) with w_i > 0. This reduces to standard least squares for \tilde{A} = W^{1/2} A and \tilde{b} = W^{1/2} b: \hat{x} = (\tilde{A}^T \tilde{A})^{-1} \tilde{A}^T \tilde{b} = (A^T W A)^{-1} A^T W b.

Weights w_i = 1/\sigma_i^2 give maximum likelihood estimation when errors in y_i are independent Gaussian with variance \sigma_i^2.

13.11.1 Example: Sensor Fusion

Consider k sensors measuring a parameter x \in \mathbb{R}^n with measurements y_i = H_i x + \epsilon_i where \epsilon_i \sim \mathcal{N}(0, \Sigma_i). Stack measurements: y = Hx + \epsilon with H = \begin{pmatrix} H_1 \\ \vdots \\ H_k \end{pmatrix} and covariance \Sigma = \operatorname{diag}(\Sigma_1, \ldots, \Sigma_k). The MLE is weighted least squares with W = \Sigma^{-1}: \hat{x} = (H^T \Sigma^{-1} H)^{-1} H^T \Sigma^{-1} y. This optimally combines information from multiple sensors, weighting by precision.


13.12 Recursive Least Squares

In online settings, data arrives sequentially. Recursive least squares (RLS) updates the solution incrementally without recomputing from scratch.

13.12.1 Example: Adaptive Noise Cancellation

In signal processing, a desired signal s(t) is corrupted by noise n(t): observation y(t) = s(t) + n(t). A reference sensor provides correlated noise r(t). Model n(t) \approx \sum_{k=0}^{p-1} w_k r(t-k) with filter coefficients w \in \mathbb{R}^p. As data arrives, RLS adaptively updates w to minimize \sum (y(t) - \sum w_k r(t-k))^2, canceling noise from y(t) to recover s(t).

13.12.2 Example: System Identification

An unknown linear system with impulse response h[n] is excited by input x[n], producing output y[n] = \sum_{k=0}^{N-1} h[k] x[n-k] + \epsilon[n]. Collect observations (x, y) and solve y \approx Xh where X is a Toeplitz matrix of delayed inputs. Least squares estimates h, identifying the system. RLS tracks time-varying systems where h changes slowly.

Suppose we have computed \hat{x}_k from the first k data points. When (t_{k+1}, y_{k+1}) arrives, update \hat{x}_{k+1} using the Sherman-Morrison formula: (A_{k+1}^T A_{k+1})^{-1} = P_k - \frac{P_k a_{k+1} a_{k+1}^T P_k}{1 + a_{k+1}^T P_k a_{k+1}}, where a_{k+1} is the new row appended to A_k and P_k = (A_k^T A_k)^{-1}. This O(n^2) update avoids the O(n^3) cost of re-solving.

RLS is central to adaptive filtering, system identification, and real-time signal processing.


13.13 Closing Remarks

The singular value decomposition extends diagonalization to arbitrary linear maps between different spaces. Every map T : \mathcal{V} \to \mathcal{W} decomposes as T(v) = \sum \sigma_i \langle v, v_i \rangle u_i, revealing intrinsic scaling factors \sigma_i independent of basis choice.

SVD reduces to the spectral theorem: singular values of T are square roots of eigenvalues of T^* T. The pseudo-inverse T^\dagger generalizes inversion, providing least squares solutions with minimal norm. The Eckart-Young theorem shows truncated SVD gives optimal low-rank approximations, fundamental to data compression. PCA exploits SVD to extract principal variance directions, enabling dimensionality reduction.

Least squares problems—minimizing \|Ax - b\|—arise ubiquitously in data fitting, parameter estimation, and inverse problems. The normal equations A^T A \hat{x} = A^T b characterize solutions, but the QR method provides better numerical stability. The SVD offers a third perspective, connecting least squares to pseudo-inverses and revealing the role of singular values in conditioning.

These tools permeate numerical linear algebra, statistics, and applied mathematics. SVD is algorithmic, stable, and universal—applicable to any matrix. Least squares is the workhorse of data analysis, from linear regression to machine learning. Together, they form the computational backbone of modern scientific computing.