i33ym

Adaline & Gradient Descent

In 1960, just three years after the perceptron, Bernard Widrow and Ted Hoff at Stanford introduced a subtle but profound improvement. They called it Adaline, short for Adaptive Linear Neuron, and it introduced the algorithm that powers all of deep learning today: gradient descent.

The perceptron worked, but it had a problem. It could only tell you whether you were right or wrong, never how wrong. Adaline fixed this, and in doing so, laid the foundation for everything from image recognition to ChatGPT.

The Problem with the Perceptron

Recall how the perceptron learns: if it makes a mistake, it nudges the weights. If it gets the answer right, it does nothing. This binary feedback is like learning to throw darts while blindfolded, and someone only tells you "hit" or "miss" - never "a little to the left" or "way too high."

More precisely, the perceptron updates based on the thresholded output:

prediction = 1 if (w @ x + b) >= 0 else 0
error = target - prediction  # Can only be -1, 0, or +1
w += learning_rate * error * x

Whether you miss by 0.001 or by 1000, the update is the same. This is inefficient at best, and can cause oscillation at worst.

The Key Insight

Adaline's innovation was simple: update the weights based on the continuous net input, not the thresholded prediction. Instead of asking "did I get it right?", Adaline asks "how far off was I?"

The perceptron updates based on the thresholded output (0 or 1), giving it a binary error signal. Its feedback is simply "wrong" or "right." Adaline updates based on the continuous output (any number), giving it a continuous error signal. Its feedback is "wrong by this much."

The threshold is still used for final predictions, but learning happens on the raw output.

The Loss Function

To measure "how far off" we are, Adaline uses Mean Squared Error (MSE):

L(w) = (1/n) * sum((y - z)^2)

Where z is the continuous net input (w dot x + b), not the thresholded prediction.

Why squared error? Three reasons:

  1. Positive and negative errors do not cancel out
  2. Larger errors are penalized more heavily (quadratically)
  3. It is differentiable everywhere - crucial for what comes next

Gradient Descent

Here is where calculus enters machine learning. The loss function defines a surface over the space of all possible weights. We want to find the lowest point on this surface - the weights that minimize error.

Imagine you are blindfolded on a hilly landscape and need to find the lowest point. What would you do? Feel which way is downhill and take a step in that direction. Repeat until you stop going down. That is gradient descent.

The gradient tells us which direction is "uphill" - the direction of steepest increase. So we move in the opposite direction:

w = w - learning_rate * gradient

For MSE loss, the gradient works out to:

gradient = -(2/n) * sum((y - z) * x)

Which gives us the update rule:

w += learning_rate * (2/n) * X.T @ errors
b += learning_rate * (2/n) * errors.sum()

The Implementation

import numpy as np

class AdalineGD:
    def __init__(self, eta=0.01, n_iter=50, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state
    
    def fit(self, X, y):
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=X.shape[1])
        self.b_ = np.float64(0.0)
        self.losses_ = []
        
        for _ in range(self.n_iter):
            net_input = self.net_input(X)
            errors = y - net_input
            
            self.w_ += self.eta * 2.0 * X.T.dot(errors) / X.shape[0]
            self.b_ += self.eta * 2.0 * errors.mean()
            
            loss = (errors ** 2).mean()
            self.losses_.append(loss)
        
        return self
    
    def net_input(self, X):
        return np.dot(X, self.w_) + self.b_
    
    def predict(self, X):
        return np.where(self.net_input(X) >= 0.5, 1, 0)

Notice how clean this is. The entire training loop is just:

  1. Compute net input for all samples
  2. Calculate errors
  3. Update weights in the direction that reduces error
  4. Record the loss

The Learning Rate Problem

There is a catch. The learning rate (eta) must be chosen carefully:

Too large: You overshoot the minimum, bounce back and forth, and the loss explodes to infinity. Your model learns nothing.

Too small: You take tiny steps and convergence takes forever. Training might stop before reaching the minimum.

Just right: Steady descent toward the minimum. Usually found by trial and error, starting with 0.01 or 0.1.

Feature Scaling

Gradient descent works much better when features are on similar scales. If one feature ranges from 0-1 and another from 0-1,000,000, the loss surface becomes elongated and gradient descent struggles.

The solution is standardization:

X_std = (X - X.mean(axis=0)) / X.std(axis=0)

This transforms each feature to have mean=0 and standard deviation=1. The loss surface becomes more "spherical" and gradient descent converges much faster.

Batch vs Stochastic

The implementation above is batch gradient descent - it computes the gradient over all training samples before updating.

Stochastic Gradient Descent (SGD) updates after each sample:

  • Faster iterations
  • Noisier updates (can help escape local minima)
  • Enables online learning from streaming data

In practice, everyone uses mini-batch SGD: update based on small batches of 32, 64, or 128 samples. Best of both worlds.

Why This Matters

Gradient descent is how we train every neural network. GPT-4, DALL-E, self-driving cars, protein folding - all trained with gradient descent.

What changes in deep learning:

  • More layers - Adaline has one layer, deep networks have many
  • Non-linear activations - ReLU, sigmoid, tanh instead of linear
  • Backpropagation - Chain rule to compute gradients through layers
  • Different loss functions - Cross-entropy for classification, etc.

But the core algorithm is identical: compute how wrong you are, figure out which direction reduces the error, take a step, repeat.

Try It Yourself

I built an interactive demo comparing Perceptron and Adaline side by side. Watch how Adaline follows the gradient smoothly while the Perceptron jumps around: i33ym.cc/demo-adaline

The Jupyter notebook with all code is on GitHub: github.com/i33ym/workshop

What's Next

We have covered the two foundational algorithms: the perceptron's discrete update rule and Adaline's gradient descent. But both are limited to linear decision boundaries.

In the next essay, we will break through this barrier with multi-layer perceptrons. By stacking layers and adding non-linear activations, we can approximate any function. This is where neural networks become truly powerful - and where backpropagation becomes essential.

↑ 0 ↓ 0

Comments (0)

No comments yet. Be the first to share your thoughts.

← back to essays