The Backpropagation Engine: Demystifying the Chain Rule
Tracking how local sensitivities multiply through stacked pipelines to fit any curve.
A small catalog of functions come with derivatives we can write down immediately; they are the primitives we will compose into more elaborate models. Start with the straight line. For a linear function ℓ(x) = ax + b, the derivative is constant: ℓ′(x) = a. The reason is geometric. The curve and its tangent coincide at every point—both are the same line—so the local slope never changes. The number a is literally the slope of the line everywhere; shifting by b only translates the graph without affecting slope.
Quadratics illustrate how slopes can vary with x. For q(x) = x^2, the derivative is q′(x) = 2x. Near x = 0 the graph is flat, and the derivative is 0; as |x| grows, the parabola steepens in proportion to x, so the slope doubles with each doubling of x. This dependence of the slope on position is captured exactly by 2x. The same pattern generalizes: the power rule states that for n a positive integer,
(x^n)′ = n · x^(n−1).The derivative reduces the power by one and pulls down the original exponent as a multiplier, matching the intuition that higher-degree monomials grow (and thus steepen) faster.
Some functions have especially clean derivative formulas and appear constantly in modeling. The natural exponential has the remarkable property (e^x)′ = e^x: its rate of change at any point equals its current value. The natural logarithm has derivative (ln x)′ = 1/x for x > 0, indicating diminishing sensitivity as x grows; doubling x halves the local slope.
These rules are not tricks; they are compact statements about local rate of change for specific, well-understood shapes. They are also exactly the kinds of closed-form derivatives we rely on as atomic pieces. When we use one of these primitives inside a larger expression, we carry along its known local behavior.
Common primitives differ in how their local slope behaves as the input changes.
To combine primitives, we need rules that describe how derivatives behave under algebraic operations. The sum rule says that differentiation distributes over addition:



