13 Singular Value Decomposition and Least Squares
13.1 Motivation
The spectral theorem diagonalizes self-adjoint operators via orthonormal eigenbases. For a general linear map T : \mathcal{V} \to \mathcal{W} between different spaces, or for non-self-adjoint operators, eigenvalues may not exist or may not suffice. Yet every such map admits a canonical decomposition using orthonormal bases in the domain and codomain.
The Singular Value Decomposition (SVD). Every linear map T : \mathcal{V} \to \mathcal{W} between finite-dimensional inner product spaces has orthonormal bases (v_1, \ldots, v_n) of \mathcal{V} and (u_1, \ldots, u_m) of \mathcal{W} such that T(v_i) = \sigma_i u_i, where \sigma_i \ge 0 are the singular values of T. The map is diagonal with respect to these bases.
This generalizes diagonalization: instead of one orthonormal basis diagonalizing T, we use two bases—one in the domain, one in the codomain. The singular values \sigma_i encode the scaling along principal directions.
SVD reveals the geometry of linear maps: it decomposes any map into a rotation (by v_i), scaling (by \sigma_i), and another rotation (by u_i). It provides the best low-rank approximations, pseudo-inverses for non-square or singular matrices, and the foundation for principal component analysis (PCA).
Throughout, \mathcal{V} and \mathcal{W} are finite-dimensional inner product spaces over \mathbb{R} or \mathbb{C} with \dim \mathcal{V} = n and \dim \mathcal{W} = m.
13.2 Singular Values and the SVD Theorem
Given T : \mathcal{V} \to \mathcal{W}, we relate T to a self-adjoint operator. The composition T^* T : \mathcal{V} \to \mathcal{V} satisfies \langle T^* T(v), w \rangle = \langle T(v), T(w) \rangle = \overline{\langle T(w), T(v) \rangle} = \overline{\langle T^* T(w), v \rangle} = \langle v, T^* T(w) \rangle, which is exactly the condition \langle T^*T(v), w \rangle = \langle v, T^*T(w) \rangle, so T^* T is self-adjoint. (In the real case the conjugates are unnecessary, but the complex case requires them.) By Theorem 12.7, T^* T has an orthonormal eigenbasis with real eigenvalues.
The operator T^* T : \mathcal{V} \to \mathcal{V} is positive semi-definite: \langle T^* T(v), v \rangle \ge 0 for all v. Its eigenvalues are non-negative.
Proof. For any v \in \mathcal{V}, \langle T^* T(v), v \rangle = \langle T(v), T(v) \rangle = \|T(v)\|^2 \ge 0. If \lambda is an eigenvalue with eigenvector v \neq 0, then \lambda \|v\|^2 = \langle \lambda v, v \rangle = \langle T^* T(v), v \rangle = \|T(v)\|^2 \ge 0, so \lambda \ge 0. \square
Definition 13.1 (Singular values) Let T : \mathcal{V} \to \mathcal{W} be a linear map. The singular values of T are \sigma_1, \ldots, \sigma_r, where \sigma_i = \sqrt{\lambda_i} and \lambda_1 \ge \cdots \ge \lambda_r > 0 are the positive eigenvalues of T^* T. By convention, order them \sigma_1 \ge \cdots \ge \sigma_r > 0.
If \dim \mathcal{V} = n and T^* T has r positive eigenvalues, then T has r non-zero singular values. The rank of T equals r, since \operatorname{rank}(T) = \operatorname{rank}(T^* T).
Theorem 13.1 (Singular Value Decomposition) Let T : \mathcal{V} \to \mathcal{W} be a linear map with singular values \sigma_1 \ge \cdots \ge \sigma_r > 0. There exist orthonormal bases (v_1, \ldots, v_n) of \mathcal{V} and (u_1, \ldots, u_m) of \mathcal{W} such that \begin{align*} T(v_i) &= \sigma_i u_i \quad \text{for } i = 1, \ldots, r, \\ T(v_i) &= 0 \quad \text{for } i = r+1, \ldots, n. \end{align*}
Proof. Let \lambda_1 \ge \cdots \ge \lambda_r > 0 be the positive eigenvalues of T^* T, with orthonormal eigenvectors v_1, \ldots, v_r. Extend to an orthonormal basis (v_1, \ldots, v_n) of \mathcal{V} by adding an orthonormal basis of \ker(T^* T) = \ker(T).
For i \le r, let \sigma_i = \sqrt{\lambda_i} and define u_i = \frac{1}{\sigma_i} T(v_i). We verify (u_1, \ldots, u_r) is orthonormal. For i, j \le r, \begin{align*} \langle u_i, u_j \rangle &= \frac{1}{\sigma_i \sigma_j} \langle T(v_i), T(v_j) \rangle \\ &= \frac{1}{\sigma_i \sigma_j} \langle T^* T(v_i), v_j \rangle \\ &= \frac{1}{\sigma_i \sigma_j} \langle \lambda_i v_i, v_j \rangle \\ &= \frac{\lambda_i}{\sigma_i \sigma_j} \delta_{ij} = \delta_{ij}, \end{align*} using \lambda_i = \sigma_i^2. Thus (u_1, \ldots, u_r) is orthonormal.
By construction, T(v_i) = \sigma_i u_i for i \le r. For i > r, v_i \in \ker(T^* T) = \ker(T), so T(v_i) = 0.
Extend (u_1, \ldots, u_r) to an orthonormal basis (u_1, \ldots, u_m) of \mathcal{W}. This completes the SVD. \square
The SVD provides a canonical form for any linear map. In coordinates, if A is the matrix of T with respect to standard bases, then A = U \Sigma V^*, where U, V are orthogonal (or unitary) matrices and \Sigma is a diagonal matrix with entries \sigma_1, \ldots, \sigma_r followed by zeros.
Corollary 13.1 Every m \times n real matrix A satisfies A = U \Sigma V^T, where U \in \mathbb{R}^{m \times m} and V \in \mathbb{R}^{n \times n} are orthogonal matrices, and \Sigma \in \mathbb{R}^{m \times n} is diagonal with non-negative entries \sigma_1 \ge \cdots \ge \sigma_r > 0 on the diagonal.
Similarly, every complex matrix A satisfies A = U \Sigma V^* with unitary U, V.
13.2.1 Example: SVD of a 2×2 Matrix
Consider A = \begin{pmatrix} 3 & 0 \\ 0 & -2 \end{pmatrix}. Then A^T A = \begin{pmatrix} 9 & 0 \\ 0 & 4 \end{pmatrix}, with eigenvalues \lambda_1 = 9, \lambda_2 = 4 and eigenvectors v_1 = (1, 0)^T, v_2 = (0, 1)^T. The singular values are \sigma_1 = 3, \sigma_2 = 2.
Compute u_1 = \frac{1}{3} A v_1 = (1, 0)^T, \quad u_2 = \frac{1}{2} A v_2 = (0, -1)^T. Thus A = U \Sigma V^T with U = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}, \quad \Sigma = \begin{pmatrix} 3 & 0 \\ 0 & 2 \end{pmatrix}, \quad V = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}.
13.3 Geometric Interpretation
The SVD reveals the action of T on unit spheres. The unit sphere \{v : \|v\| = 1\} in \mathcal{V} is mapped by T to an ellipsoid in \mathcal{W}. The axes of this ellipsoid are \sigma_i u_i, with semi-axis lengths \sigma_i.
To see this, write v = \sum_{i=1}^n \alpha_i v_i with \|v\|^2 = \sum \alpha_i^2 = 1. Then T(v) = \sum_{i=1}^r \sigma_i \alpha_i u_i. The image T(v) lies in \operatorname{span}\{u_1, \ldots, u_r\} = \operatorname{range}(T) with coordinates (\sigma_1 \alpha_1, \ldots, \sigma_r \alpha_r) relative to (u_1, \ldots, u_r). The constraint \sum \alpha_i^2 = 1 becomes \sum_{i=1}^r \frac{(\sigma_i \alpha_i)^2}{\sigma_i^2} = 1, the equation of an ellipsoid with semi-axes \sigma_i.
The largest singular value \sigma_1 is the operator norm: \|T\| = \sup_{\|v\| = 1} \|T(v)\| = \sigma_1. The direction of maximum stretching is v_1; the direction of minimum stretching (among non-zero stretches) is v_r.

13.4 The Pseudo-Inverse
For non-square or singular matrices, no true inverse exists. The pseudo-inverse generalizes inversion to arbitrary matrices, providing the best approximate solution to Ax = b.
Definition 13.2 (Moore-Penrose pseudo-inverse) Let T : \mathcal{V} \to \mathcal{W} have SVD with singular values \sigma_1, \ldots, \sigma_r > 0 and orthonormal bases (v_i), (u_i). The pseudo-inverse T^\dagger : \mathcal{W} \to \mathcal{V} is defined by \begin{align*} T^\dagger(u_i) &= \frac{1}{\sigma_i} v_i \quad \text{for } i = 1, \ldots, r, \\ T^\dagger(u_i) &= 0 \quad \text{for } i = r+1, \ldots, m. \end{align*}
In matrix form, if A = U \Sigma V^T, then A^\dagger = V \Sigma^\dagger U^T, where \Sigma^\dagger has entries 1/\sigma_i on the diagonal for i \le r and zeros elsewhere.
The pseudo-inverse satisfies: 1. T T^\dagger T = T 2. T^\dagger T T^\dagger = T^\dagger 3. (T T^\dagger)^* = T T^\dagger (self-adjoint) 4. (T^\dagger T)^* = T^\dagger T (self-adjoint)
These are the Moore-Penrose conditions characterizing T^\dagger uniquely.
Theorem 13.2 For b \in \mathcal{W}, the vector \hat{x} = T^\dagger(b) minimizes \|T(x) - b\| over all x \in \mathcal{V}. Among all minimizers, \hat{x} has the smallest norm.
Proof. Write b = \sum_{i=1}^m \beta_i u_i. Then T^\dagger(b) = \sum_{i=1}^r \frac{\beta_i}{\sigma_i} v_i. Compute T(T^\dagger(b)) = \sum_{i=1}^r \beta_i u_i = P_{\operatorname{range}(T)}(b), the orthogonal projection of b onto \operatorname{range}(T). By Theorem 11.8, this minimizes \|T(x) - b\|.
For uniqueness of minimal norm, suppose x also minimizes \|T(x) - b\|. Then T(x) = P_{\operatorname{range}(T)}(b), so x - T^\dagger(b) \in \ker(T) = \operatorname{span}\{v_{r+1}, \ldots, v_n\}. Write x = T^\dagger(b) + w with w \in \ker(T). Then \|x\|^2 = \|T^\dagger(b)\|^2 + \|w\|^2 \ge \|T^\dagger(b)\|^2, with equality iff w = 0. Thus \hat{x} = T^\dagger(b) has minimal norm. \square
The pseudo-inverse provides the least squares solution directly: for Ax = b, the solution is \hat{x} = A^\dagger b.
13.5 Low-Rank Approximations
Given a matrix A \in \mathbb{R}^{m \times n} with rank r, we seek the best rank-k approximation for k < r. “Best” means minimizing \|A - B\| over all matrices B of rank at most k, using the Frobenius norm \|A\|_F = \sqrt{\sum_{ij} a_{ij}^2} or the operator norm \|A\| = \sigma_1(A).
Theorem 13.3 (Eckart-Young-Mirsky theorem) Let A = U \Sigma V^T be the SVD with singular values \sigma_1 \ge \cdots \ge \sigma_r > 0. For k < r, define A_k = \sum_{i=1}^k \sigma_i u_i v_i^T. Then A_k minimizes \|A - B\| over all matrices B with \operatorname{rank}(B) \le k, and \|A - A_k\| = \sigma_{k+1}.
Proof. Write A = \sum_{i=1}^r \sigma_i u_i v_i^T. Then A_k has rank k, and A - A_k = \sum_{i=k+1}^r \sigma_i u_i v_i^T. The singular values of A - A_k = \sum_{i=k+1}^r \sigma_i u_i v_i^T are \sigma_{k+1}, \ldots, \sigma_r: the sets \{v_{k+1},\ldots,v_r\} and \{u_{k+1},\ldots,u_r\} are orthonormal and the terms in the sum have pairwise orthogonal row and column spaces. The operator norm equals the largest singular value, so \|A - A_k\|_2 = \sigma_{k+1}.
To show A_k is optimal, suppose \operatorname{rank}(B) \le k. Then \ker(B) has dimension at least n - k. The span \operatorname{span}\{v_1, \ldots, v_{k+1}\} has dimension k+1, so there exists nonzero v \in \operatorname{span}\{v_1, \ldots, v_{k+1}\} \cap \ker(B) with \|v\| = 1. Write v = \sum_{i=1}^{k+1} \alpha_i v_i with \sum \alpha_i^2 = 1.
Then B(v) = 0 and \begin{align*} \|A - B\| &\ge \|(A - B)(v)\| = \|A(v)\| \\ &= \left\| \sum_{i=1}^{k+1} \sigma_i \alpha_i u_i \right\| \\ &= \sqrt{\sum_{i=1}^{k+1} \sigma_i^2 \alpha_i^2} \ge \sigma_{k+1}, \end{align*} using \sigma_i \ge \sigma_{k+1} for i \le k+1. Thus \|A - B\| \ge \sigma_{k+1} = \|A - A_k\|. \square
This result underlies data compression and dimensionality reduction: keeping the top k singular values captures the most important structure in A. The truncated SVD A_k is the best rank-k approximation in both operator and Frobenius norms.
Remark (PCA). Given a centered data matrix X \in \mathbb{R}^{n \times N}, the covariance matrix C = \frac{1}{N}XX^T is symmetric positive semidefinite. The SVD X = U\Sigma V^T diagonalizes C = \frac{1}{N}U\Sigma^2 U^T, so the columns of U are the principal component directions ordered by variance. The Eckart-Young theorem then shows that projecting onto the top k singular vectors gives the best rank-k approximation to X, minimizing reconstruction error — the basis of dimensionality reduction.
13.6 Least Squares Problems
The least squares problem seeks \hat{x} \in \mathbb{R}^n minimizing \|Ax - b\| for given A \in \mathbb{R}^{m \times n} and b \in \mathbb{R}^m. When Ax = b has no exact solution (overdetermined system with m > n), we find the best approximation.
This problem pervades applied mathematics: fitting curves to data, parameter estimation, signal reconstruction. The solution connects orthogonal projections, the pseudo-inverse, and QR decomposition.
Theorem 13.4 (Least squares solution) The vector \hat{x} \in \mathbb{R}^n minimizes \|Ax - b\| if and only if A^T A \hat{x} = A^T b. If A has full column rank, the solution is unique: \hat{x} = (A^T A)^{-1} A^T b.
Proof. Write b = P + r where P = A\hat{x} \in \operatorname{Col}(A) is the projection of b onto \operatorname{Col}(A) and r \perp \operatorname{Col}(A). By the projection theorem (Theorem 11.8 in Chapter 11), \hat{x} minimizes \|Ax - b\| iff b - A\hat{x} \perp \operatorname{Col}(A).
This orthogonality condition states: for all y \in \mathbb{R}^n, \langle Ay, b - A\hat{x} \rangle = 0. Equivalently, A^T(b - A\hat{x}) = 0, yielding the normal equations A^T A \hat{x} = A^T b.
If A has full column rank, \ker(A) = \{0\}, so \ker(A^T A) = \{0\} and A^T A is invertible. Thus \hat{x} = (A^T A)^{-1} A^T b is the unique solution. \square
The matrix (A^T A)^{-1} A^T is the left pseudo-inverse of A when A has full column rank. It generalizes matrix inversion to rectangular matrices.
13.6.1 Example: Linear Regression
Given data points (t_1, y_1), \ldots, (t_m, y_m), fit a line y = c_0 + c_1 t. Minimize \sum_{i=1}^m (y_i - c_0 - c_1 t_i)^2 = \|Ax - b\|^2, where A = \begin{pmatrix} 1 & t_1 \\ \vdots & \vdots \\ 1 & t_m \end{pmatrix}, \quad x = \begin{pmatrix} c_0 \\ c_1 \end{pmatrix}, \quad b = \begin{pmatrix} y_1 \\ \vdots \\ y_m \end{pmatrix}.
The normal equations A^T A x = A^T b yield \begin{align*} A^T A &= \begin{pmatrix} m & \sum t_i \\ \sum t_i & \sum t_i^2 \end{pmatrix}, \quad A^T b = \begin{pmatrix} \sum y_i \\ \sum t_i y_i \end{pmatrix}. \end{align*} Solving gives the least squares regression line.
13.7 QR Method for Least Squares
Computing (A^T A)^{-1} is numerically unstable when A^T A is ill-conditioned (near-singular). The QR decomposition provides a more stable algorithm.
Compute A = QR where Q \in \mathbb{R}^{m \times m} is orthogonal and R \in \mathbb{R}^{m \times n} is upper triangular. Partition R = \begin{pmatrix} R_1 \\ 0 \end{pmatrix} with R_1 \in \mathbb{R}^{n \times n} upper triangular. Then \|Ax - b\|^2 = \|QRx - b\|^2 = \|Rx - Q^T b\|^2, using orthogonality \|Qv\| = \|v\|. Write Q^T b = \begin{pmatrix} c \\ d \end{pmatrix} with c \in \mathbb{R}^n, d \in \mathbb{R}^{m-n}. Then \|Rx - Q^T b\|^2 = \|R_1 x - c\|^2 + \|d\|^2. Minimizing over x requires R_1 x = c, solved by back-substitution since R_1 is upper triangular. The minimum residual is \|d\|.
Corollary 13.2 If A = QR with Q orthogonal and R = \begin{pmatrix} R_1 \\ 0 \end{pmatrix} with R_1 upper triangular and invertible, then the least squares solution is \hat{x} = R_1^{-1} c, where c consists of the first n components of Q^T b.
This method avoids forming A^T A, improving numerical stability. The condition number of R_1 equals that of A, whereas the condition number of A^T A is the square of A’s condition number, amplifying errors.
13.8 Connection to Pseudo-Inverse and SVD
The pseudo-inverse (Section 13.4) provides another perspective on least squares. By Theorem 13.2, \hat{x} = A^\dagger b minimizes \|Ax - b\| with minimal norm among all minimizers.
Using the SVD A = U \Sigma V^T, we have A^\dagger = V \Sigma^\dagger U^T, so \hat{x} = V \Sigma^\dagger U^T b = \sum_{i=1}^r \frac{\langle b, u_i \rangle}{\sigma_i} v_i. This expresses the solution as a sum over singular vectors, weighted by reciprocal singular values. Small singular values amplify errors in b, indicating ill-conditioning.
13.9 Closing Remarks
The SVD T = \sum \sigma_i u_i \otimes v_i is the extension of the spectral theorem to arbitrary linear maps: singular values are square roots of eigenvalues of T^*T, the two orthonormal bases replace the single eigenbasis, and the geometry is an ellipsoid with semi-axes \sigma_i u_i. The pseudo-inverse inverts by reciprocating singular values, giving the minimum-norm least squares solution \hat{x} = A^\dagger b. The Eckart-Young theorem says truncating to the top k singular values is the best rank-k approximation in operator norm.
Least squares—minimizing \|Ax - b\|—is solved algebraically via the normal equations A^TA\hat{x} = A^Tb, and numerically via QR decomposition to avoid squaring the condition number.