9  The Derivative

9.1 From Linear Maps to Differentiation

We’ve spent considerable time developing the theory of vectors, linear maps, and functionals. Before defining the derivative, let’s understand precisely why this machinery was necessary.

9.1.1 Local Linearity

Consider a smooth curve y = f(x) and fix a point (a, f(a)) on it. If we zoom in at this point, something remarkable happens: the curve appears to straighten out. Locally, near a, the graph looks like a straight line

Tanget

This observation suggests a question: can we quantify this local “straightness”? More precisely, can we find a linear function that approximates f near a in a sense we can make rigorous?

9.1.2 The Space of Approximations

Suppose we want to approximate f near a point a using a simpler class of functions. The simplest choice is a constant function: f(a+h) \approx f(a) for small h. While easy to compute, this ignores any change and cannot reflect how f is moving at a.

On the other hand, we could use a quadratic approximation f(a+h) \approx f(a) + b h + c h^2. This captures both the rate of change and curvature, but requires two parameters—more than necessary if we are only interested in the instantaneous rate of change. (More sophisticated approximations, such as those from Taylor series, can capture higher-order behavior, but we will not study this until integral calculus.)

Linear approximations are the right choice. A function of the form f(a+h) \approx f(a) + m h captures the rate and direction of change with a single parameter m. The predicted change is proportional to the displacement: doubling h doubles the change, and reversing h reverses it. This simple structure is exactly what is needed to quantify how fast f is changing at a.

9.1.3 Linear Maps and Measurements

Recall from our earlier development that a map T : \mathbb{R} \to \mathbb{R} is linear if it preserves addition and scalar multiplication. In one dimension, every such map has the form T(h) = mh for some constant m. The entire transformation is encoded in this single scalar.

Now recall linear functionals—linear maps that produce real numbers. These act as measurement devices: given an input, they return a scalar that quantifies some property of that input. For our purposes, the differential df_a will be such a functional. Given a small displacement h from the point a, it returns the approximate change in f df_a(h) = \text{approximate change in } f \text{ when moving distance } h \text{ from } a.

Since df_a is linear, it must have the form df_a(h) = mh for some m depending on a. This scalar m is what we will call the derivative f'(a).

NoteNotation Convention

Throughout, we distinguish:

  • df_a denotes the linear map \mathbb{R} \to \mathbb{R} (the geometric object)

  • f'(a) denotes its coordinate representation (a single number in 1D)

  • Relationship: df_a(h) = f'(a) \cdot h

In higher dimensions, f'(a) becomes the Jacobian matrix—a matrix of partial derivatives. The framework remains identical.

9.1.4 Measuring Approximation Quality

How do we determine whether a given linear map T(h) = mh is a good approximation to the change in f near a? We measure the error E(h) = f(a+h) - f(a) - T(h).

But absolute error alone is insufficient. An error of 0.001 might be negligible if the displacement h is 0.01, but enormous if h is 0.000001. What matters is whether the error is small relative to the displacement itself.

This is where the norm enters. In one dimension, the norm is simply absolute value |h|, measuring the size of the displacement. We consider the ratio \frac{|E(h)|}{|h|}.

A linear approximation is “best” if this ratio approaches zero as h \to 0. The error then vanishes faster than the displacement—it becomes negligible in the limit. This is the precise sense in which T captures the first-order behavior of f near a.

9.1.5 A Concrete Example

Let f(x) = x^2 and consider the point a = 1. For a small displacement h, direct computation gives f(1+h) = (1+h)^2 = 1 + 2h + h^2.

We identify the linear term 2h and the error E(h) = h^2. Is this error negligible relative to h? We compute \frac{|E(h)|}{|h|} = \frac{|h^2|}{|h|} = |h| \to 0 \quad \text{as } h \to 0.

Indeed, the linear map T(h) = 2h approximates the change in f near a = 1 with vanishing relative error. The coefficient 2 is the derivative f'(1) = 2.

Secant to Tangent

The general pattern: expand f(a+h), identify the linear term, verify that the remainder is o(h), and recognize the coefficient of the linear term as the derivative.

9.2 The Formal Definition

Definition 9.1 (Differentiability) Let f : I \to \mathbb{R} where I is an interval, and let a be an interior point of I. We say f is differentiable at a if there exists a linear map T : \mathbb{R} \to \mathbb{R} such that \lim_{h \to 0} \frac{|f(a+h) - f(a) - T(h)|}{|h|} = 0.

Since every linear map in one dimension has the form T(h) = m h, we write T(h) = f'(a) h, and the scalar f'(a) is called the derivative of f at a.

9.2.1 Alternative formulation

Dividing the approximation error by h and rearranging, \frac{f(a+h) - f(a)}{h} = f'(a) + \frac{E(h)}{h}.

As h \to 0, the right side approaches f'(a). Thus differentiability is equivalent to the existence of the limit f'(a) = \lim_{h \to 0} \frac{f(a+h) - f(a)}{h}.

This is the classical difference quotient definition. Both formulations are equivalent, but the linear map perspective emphasizes that the derivative is fundamentally about approximation by a linear transformation.

Notation: We write f'(a), \frac{df}{dx}\big|_{x=a}, or Df(a) for the derivative at a. The notation Df(a) emphasizes the derivative as a linear map, a viewpoint essential in higher dimensions.

If a function is well-approximated by a linear map near a point, what does this tell us about the function’s behavior at that point? Linear maps are continuous—they can’t have jumps or breaks. If f is close to a linear map near a, it seems reasonable that f itself should be continuous at a. This intuition is correct and leads to one of the most important basic facts about derivatives.

Theorem 9.1 (Differentiability Implies Continuity.) Let f : I \to \mathbb{R} and a \in I. If f is differentiable at a, then f is continuous at a.

Differentiability means there exists a linear map T(h) = f'(a) h such that f(a+h) = f(a) + T(h) + o(h), \quad h \to 0.

By definition of o(h), for any \varepsilon > 0 there exists \delta > 0 such that |h| < \delta implies \frac{|f(a+h) - f(a) - f'(a) h|}{|h|} < \varepsilon.

Hence |f(a+h) - f(a)| = |f'(a) h + o(h)| \le |f'(a)|\,|h| + |o(h)| < (|f'(a)| + \varepsilon) |h|.

As h \to 0, the right-hand side tends to 0. Therefore, \lim_{h \to 0} f(a+h) = f(a), so f is continuous at a. \square

9.3 Linear Approximation and the Tangent Line

If f is differentiable at a, the linear map df_a(h) = f'(a) h determines the tangent line to the graph of f at (a, f(a)).

Tangent Space

Recall that a linear map in 1D is completely determined by where it sends the basis vector 1: df_a(1) = f'(a) \cdot 1 = f'(a).

This single value encodes the entire linear transformation. Geometrically, it’s the slope of the tangent line. Algebraically, it’s the functional that measures change.

Writing x = a + h, the linear approximation becomes f(x) \approx f(a) + f'(a)(x - a), which is the equation of the tangent line.

Near x = a, the function f(x) is well approximated by this linear function. The derivative f'(a) is the slope of the tangent line, but more fundamentally, it is the coefficient of the linear map that best approximates the change in f near a.

9.4 The Differential and Linear Functionals

The differential df_a is a linear functional in the sense of Section 8.5. It takes a displacement h \in \mathbb{R} and returns a scalar df_a(h) = f'(a)h \in \mathbb{R}.

In coordinates, if we think of h as the column vector \begin{pmatrix} h \end{pmatrix} and f'(a) as the row vector \begin{pmatrix} f'(a) \end{pmatrix}, then df_a(h) = \begin{pmatrix} f'(a) \end{pmatrix} \begin{pmatrix} h \end{pmatrix} = f'(a) \cdot h.

In one dimension, the distinction between f'(a) (a scalar) and df_a (a linear functional) is subtle: they contain the same information. The functional df_a is simply multiplication by the scalar f'(a).

Derivative as Linear Map

9.4.0.1 Examples

Example 1: Let f(x) = x^2 and compute f'(a).

We seek a linear map T(h) = mh such that \lim_{h \to 0} \frac{|(a+h)^2 - a^2 - mh|}{|h|} = 0.

Expanding (a+h)^2 = a^2 + 2ah + h^2, we have \frac{|a^2 + 2ah + h^2 - a^2 - mh|}{|h|} = \frac{|(2a - m)h + h^2|}{|h|} \leq |2a - m| + |h|.

This approaches zero as h \to 0 if and only if m = 2a. Thus f'(a) = 2a, and the linear approximation is f(a+h) = a^2 + 2ah + h^2, where h^2 = o(h) is the error.

Example 2: Show that f(x) = |x| is not differentiable at x = 0.

Suppose f were differentiable at 0 with derivative m. Then \lim_{h \to 0} \frac{||h| - 0 - mh|}{|h|} = 0.

For h > 0, this gives \frac{|h - mh|}{h} = |1 - m| \to 0, requiring m = 1.

For h < 0, this gives \frac{|-h - mh|}{|h|} = \frac{|h||{-1 - m}|}{|h|} = |{-1 - m}| \to 0, requiring m = -1.

No single value of m satisfies both conditions. Therefore f is not differentiable at 0. Geometrically, there is no linear map that approximates the change in |x| from both sides at x = 0.

Example 3: Compute the derivative of f(x) = \frac{1}{x} for x \neq 0.

For a \neq 0, \frac{f(a+h) - f(a)}{h} = \frac{\frac{1}{a+h} - \frac{1}{a}}{h} = \frac{a - (a+h)}{h \cdot a(a+h)} = \frac{-1}{a(a+h)}.

As h \to 0, this approaches -\frac{1}{a^2}. Thus f'(a) = -\frac{1}{a^2}.


9.5 The Mean Value Theorem

In one dimension, a linear functional \varphi(h) = mh is determined by the single scalar m.

The differential df_c is such a functional, with m = f'(c). As the base point c varies, we obtain a family of linear functionals, each capturing the local behavior of f near c df_c(h) = f'(c) h.

Each differential provides a local linear approximation f(c+h) = f(c) + df_c(h) + o(h).

This is accurate for small h, but what about finite displacements? Consider the interval [a,b] with displacement h = b - a. The differential at a is f(b) =f(a + (b-a)) \approx f(a) + df_a(b - a) = f(a) + f'(a)(b - a).

For linear functions, this prediction is exact everywhere. For nonlinear functions, the prediction depends on which base point we choose. The differential at a gives one prediction, the differential at b gives another, and differentials at intermediate points give yet others.

One might ask: Among this family of linear functionals \{df_c : c \in [a,b]\}, does there exist one whose prediction is exact?

The Idea

The figure suggests that there exists a point where a differential exactly captures the total change of f over [a,b]. To motivate the construction of an auxiliary function, consider the straight line connecting the endpoints (a,f(a)) and (b,f(b)). Let us define a line L satisfying L(a) = f(a), \qquad L(b) = f(b), so that it passes through the endpoints. By elementary algebra, its slope must be \frac{f(b) - f(a)}{b-a}, and hence L(x) = f(a) + \frac{f(b)-f(a)}{b-a}(x-a).

This line L encodes the “ideal” linear change across the interval. To locate a point where the derivative of f coincides with this ideal slope, it is natural to consider the difference between f and L. This motivates the definition of the auxiliary function \psi(x) = f(x) - L(x), which satisfies \psi(a) = \psi(b) = 0 by construction. The properties of \psi will then guide the identification of a point c \in (a,b) where f'(c) equals the slope of the secant line.

Theorem 9.2 (Mean Value Theorem) Let f : [a,b] \to \mathbb{R} be continuous on [a,b] and differentiable on (a,b). Then there exists c \in (a,b) such that f(b) - f(a) = f'(c)(b - a).

Equivalently, the differential at c captures the total change: f(b) - f(a) = df_c(b - a).

To locate a point where the differential exactly captures the total change, define the auxiliary function as before \psi(x) = f(x) - \Bigl(f(a) + \frac{f(b)-f(a)}{b-a}(x-a)\Bigr) = f(x) - L(x), so that \psi(a) = \psi(b) = 0. This \psi measures the deviation of f from the straight line L connecting the endpoints.

Since \psi is continuous on [a,b] and differentiable on (a,b), Theorem 7.5 guarantees it attains a extremum at some point c \in [a,b]. If the extremum occurs in the interior (which must happen unless \psi:= 0), the differential vanishes d\psi_c(h) = 0 \quad \text{for all } h.

By linearity, d\psi_c(h) = df_c(h) - dL_c(h) = df_c(h) - \frac{f(b)-f(a)}{b-a} h. Choosing h = b-a gives df_c(b-a) = f(b)-f(a), showing that the differential at c exactly reproduces the total change of f over [a,b]. \square

Observe that the following corollary follows immediately from the theorem

Corollary 9.1 (Corollary (Rolle’s Theorem)) Let f : [a,b] \to \mathbb{R} be continuous on [a,b] and differentiable on (a,b). If f(a) = f(b), then there exists c \in (a,b) such that df_c = 0 \quad \text{or equivalently} \quad f'(c) = 0.

Consider the total change f(b)-f(a) = 0. By Theorem 9.2, some differential df_c must exactly capture this change df_c(b-a) = 0. Since b-a \neq 0, it follows that df_c = 0.

9.6 Looking Back: Linear Algebra Revisited

Now that we’ve developed differentiation, let’s revisit the linear algebra concepts from Section 8.3 and see how they manifested

Linear Algebra Concept Role in Differentiation
Linear map T: \mathbb{R} \to \mathbb{R} The derivative as an approximation: T(h) = f'(a)h
Linear functional \varphi(h) The differential df_a(h) measures change
Norm \|h\| Measuring displacement to define relative error
Composition T \circ S Chain rule (coming in next chapter)
Linear combination Linearity of differentiation: (af+bg)' = af' + bg'

The key insight: differentiation is a process that extracts a linear map from a nonlinear function. Given f, the derivative operator D produces Df(a) = f'(a), which determines the linear functional df_a(h) = f'(a)h.