Backpropogation

In 1969, Marvin Minsky and Seymour Papert proved that the perceptron could not solve XOR, the exclusive-or problem. Funding dried up. Research stopped. The AI Winter began.

The solution was known all along: add more layers. A single neuron draws one line. Multiple neurons draw multiple lines. Combined, they can carve out any region. It took until 1986 for Rumelhart, Hinton, and Williams to popularize backpropagation, the algorithm that makes training these multilayer networks practical. This essay explains both ideas.

The XOR Problem

XOR outputs 1 when inputs differ, 0 when they match:

(0, 0) → 0
(0, 1) → 1
(1, 0) → 1
(1, 1) → 0

Plot these four points. The 1s sit on opposite corners. No single straight line can separate them from the 0s. This is called linear inseparability, and it is the fundamental limitation of single-layer networks.

The perceptron and Adaline are both linear classifiers. They can only learn decision boundaries that are straight lines (or hyperplanes in higher dimensions). For many real problems, this is not enough.

The Solution: Hidden Layers

The key insight is that one neuron draws one line, but multiple neurons can draw multiple lines. If we add a layer of neurons between input and output, each hidden neuron can learn to draw its own line. The output neuron then combines these to carve out any region.

For XOR, we need just two hidden neurons:

Hidden neuron 1 learns to separate (0,0) from the other points
Hidden neuron 2 learns to separate (1,1) from the other points
The output neuron combines these: output 1 if exactly one hidden neuron fires

This is the multilayer perceptron (MLP): input layer, one or more hidden layers, and output layer. Every neuron in one layer connects to every neuron in the next layer (fully connected or dense layers).

Why Non-linearity Matters

There is a catch. If we just stack linear transformations, we get another linear transformation. Three matrix multiplications in a row collapse into one. The hidden layer would be pointless.

The solution is activation functions. After computing the weighted sum, we pass it through a non-linear function. The classic choice is the sigmoid:

sigmoid(z) = 1 / (1 + exp(-z))

This squashes any input to a value between 0 and 1. More importantly, it is non-linear. Now stacking layers actually increases the network's expressive power.

The sigmoid has a convenient derivative: σ'(z) = σ(z) × (1 - σ(z)). This matters for backpropagation. Modern networks often use ReLU (max(0, z)) because it trains faster, but sigmoid is excellent for understanding the fundamentals.

Forward Propagation

Data flows through the network layer by layer. For a network with one hidden layer:

# Hidden layer
z_h = X @ W_h.T + b_h      # net input
a_h = sigmoid(z_h)          # activation

# Output layer  
z_out = a_h @ W_out.T + b_out
a_out = sigmoid(z_out)

Each layer does two things: a linear transformation (matrix multiply plus bias) and a non-linear activation. The output of one layer becomes the input to the next.

For a network with 2 inputs, 4 hidden units, and 1 output:

W_h has shape (4, 2): 4 hidden neurons, each with 2 input weights
b_h has shape (4,): one bias per hidden neuron
W_out has shape (1, 4): 1 output neuron, 4 weights from hidden layer
b_out has shape (1,): one bias for output
Total parameters: 4×2 + 4 + 1×4 + 1 = 17

This grows quickly. A network for MNIST digit classification (784 inputs, 50 hidden, 10 outputs) has 784×50 + 50 + 50×10 + 10 = 39,760 parameters.

The Loss Function

As with Adaline, we need a way to measure how wrong we are. For regression or binary classification, we use Mean Squared Error:

L = (1/n) × Σ(y - a_out)²

The goal of training is to find weights that minimize this loss. We still use gradient descent, but now we need gradients for weights in multiple layers.

The Backpropagation Algorithm

Here is the problem: hidden layer weights do not directly affect the output. They affect the hidden activations, which affect the output. How do we compute their gradients?

The answer is the chain rule from calculus:

∂L/∂w = ∂L/∂a × ∂a/∂z × ∂z/∂w

We decompose the gradient into a chain of simpler derivatives. The key insight is that we compute these right-to-left (backward through the network), reusing intermediate results.

Step 1: Output layer error. Compute how much the output contributed to the loss:

δ_out = (a_out - y) × sigmoid'(z_out)

This is the error signal: how wrong we are, scaled by how sensitive the output is to changes.

Step 2: Propagate error backward. Each hidden unit's error is proportional to how much it contributed to the output error:

δ_h = (δ_out @ W_out) × sigmoid'(z_h)

The output error flows backward through the weights, then gets scaled by each hidden unit's activation derivative.

Step 3: Compute gradients. Now we have error signals for each layer:

∂L/∂W_out = δ_out.T @ a_h
∂L/∂W_h = δ_h.T @ X

The gradient for each weight matrix is the error times the input to that layer.

Step 4: Update weights. Same as always:

W_out -= learning_rate × ∂L/∂W_out
W_h -= learning_rate × ∂L/∂W_h

Why Backward?

Computing gradients forward (input to output) requires expensive matrix-matrix multiplications. Computing backward converts these to cheaper matrix-vector multiplications because we reuse the δ terms.

This is called reverse-mode automatic differentiation. It is the same insight that makes backpropagation efficient. Without it, training deep networks would be computationally infeasible.

A Complete Implementation

import numpy as np

class MLP:
    def __init__(self, n_hidden=50, epochs=100, eta=0.001, random_state=1):
        self.n_hidden = n_hidden
        self.epochs = epochs
        self.eta = eta
        self.random_state = random_state
    
    def _sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-np.clip(z, -250, 250)))
    
    def _forward(self, X):
        z_h = X @ self.W_h.T + self.b_h
        a_h = self._sigmoid(z_h)
        z_out = a_h @ self.W_out.T + self.b_out
        a_out = self._sigmoid(z_out)
        return z_h, a_h, z_out, a_out
    
    def _backward(self, X, a_h, a_out, y):
        n = X.shape[0]
        
        # Output layer
        delta_out = (a_out - y) * a_out * (1 - a_out)
        grad_W_out = delta_out.T @ a_h / n
        grad_b_out = delta_out.mean(axis=0)
        
        # Hidden layer
        delta_h = (delta_out @ self.W_out) * a_h * (1 - a_h)
        grad_W_h = delta_h.T @ X / n
        grad_b_h = delta_h.mean(axis=0)
        
        return grad_W_h, grad_b_h, grad_W_out, grad_b_out
    
    def fit(self, X, y):
        n_features = X.shape[1]
        n_output = y.shape[1] if y.ndim > 1 else 1
        
        rng = np.random.RandomState(self.random_state)
        self.W_h = rng.normal(0, 0.1, (self.n_hidden, n_features))
        self.b_h = np.zeros(self.n_hidden)
        self.W_out = rng.normal(0, 0.1, (n_output, self.n_hidden))
        self.b_out = np.zeros(n_output)
        
        for _ in range(self.epochs):
            z_h, a_h, z_out, a_out = self._forward(X)
            grads = self._backward(X, a_h, a_out, y)
            
            self.W_h -= self.eta * grads[0]
            self.b_h -= self.eta * grads[1]
            self.W_out -= self.eta * grads[2]
            self.b_out -= self.eta * grads[3]
        
        return self
    
    def predict(self, X):
        _, _, _, a_out = self._forward(X)
        return (a_out >= 0.5).astype(int)

Training on MNIST

The classic benchmark is MNIST: 60,000 images of handwritten digits, each 28×28 pixels. Flatten each image to 784 features, use 10 output neurons (one per digit), and train with mini-batch gradient descent.

With 50 hidden neurons and 50 epochs of training, this simple MLP achieves about 94.5% accuracy on the test set. Not state-of-the-art, but remarkable for such a simple network.

Universal Approximation

In 1989, George Cybenko proved the universal approximation theorem: a neural network with one hidden layer can approximate any continuous function to arbitrary precision, given enough hidden neurons.

This is why neural networks are so powerful. They are not limited to any particular class of functions. With enough capacity and data, they can learn almost anything.

The catch is "enough." In practice, deeper networks (more layers, not just more neurons) often learn more efficiently than shallow wide networks. This is why we have deep learning.

What Changes in Modern Deep Learning

The fundamentals remain the same: forward pass, compute loss, backward pass, update weights. What changes:

Depth: Modern networks have hundreds of layers, not one or two
Activations: ReLU (max(0, z)) instead of sigmoid, to avoid vanishing gradients
Optimizers: Adam and AdamW instead of vanilla SGD, for faster convergence
Regularization: Dropout, batch normalization, weight decay to prevent overfitting
Architectures: Convolutional layers for images, attention for sequences

But at the core, it is still: weighted sum, activation, repeat; then backprop to compute gradients, step downhill, repeat.

Try It Yourself

I built an interactive demo where you can compare a single perceptron versus an MLP on the XOR problem. Watch the perceptron fail while the MLP learns to carve out the correct regions: i33ym.cc/demo-mlp

The Jupyter notebook with all code is on GitHub: github.com/i33ym/workshop

What's Next

We have covered the three foundational algorithms: the perceptron's discrete update, Adaline's gradient descent, and the MLP's backpropagation. These are the building blocks of all neural networks.

In the next workshop, we will look at practical concerns: how to debug neural networks, how to prevent overfitting, and how to tune hyperparameters. The theory is beautiful, but making it work in practice is an art.