Adaline and Gradient Descent

The Algorithm Behind All Deep Learning

January 2026

Recap: The Perceptron

Last time we learned:

$z = w * x + b$

$y = 1 if z >= 0 else 0$

Update rule: $w = w + eta * (y - y_hat) * x$

The Perceptron Works

Converges on linearly separable data

Simple and elegant

But there is a problem...

The Problem with Perceptron

No measure of "how wrong"

Binary Feedback Only

Perceptron only knows: right or wrong

Missing by 0.001? Wrong

Missing by 1000? Wrong

Same update either way

The Step Function Problem

Intuition

Imagine learning to throw darts blindfolded.

Someone only tells you "hit" or "miss" - never "a little to the left".

What We Need

A way to measure how far we are from the target

Not just direction, but magnitude

A continuous measure of error

1960

Adaline

Adaptive Linear Neuron

Bernard Widrow and Ted Hoff, Stanford

The Key Insight

Update weights based on the continuous output

Not the thresholded prediction

Perceptron vs Adaline

	Perceptron	Adaline
Computes	z = w * x + b	z = w * x + b
Updates on	step(z)	z directly
Error signal	Binary	Continuous

The Architecture

Inputs x

Net input z

Compare to y

Update weights

Threshold only used for final prediction, not learning

The Loss Function

Measuring "how wrong" mathematically

Mean Squared Error (MSE)

L(w) = (1/n) * sum((y - z)^2)

Square the errors, take the average

Why Squared Error?

Positive and negative errors do not cancel out

Larger errors are penalized more heavily

Differentiable everywhere

That last one is crucial...

The Loss Landscape

Imagine a bowl-shaped surface

Height = loss (error)

Position = weight values

Goal: find the lowest point

Why Convex Matters

MSE with linear model = convex surface

Only one minimum (global)

No local minima to get stuck in

Gradient Descent

The algorithm that powers all of deep learning

The Intuition

Intuition

You are blindfolded on a hilly landscape.

To find the lowest point, feel which way is downhill and step that direction.

Repeat until you stop going down.

The Gradient

The gradient tells us which direction is "uphill"

It is the vector of partial derivatives

$gradient = [dL/dw1, dL/dw2, ..., dL/db]$

The Update Rule

w = w - eta * gradient

Move in the opposite direction of the gradient

Opposite because gradient points uphill

Computing the Gradient

For MSE loss with linear activation:

dL/dw = -(2/n) * sum((y - z) * x)

Error times input, averaged over all samples

The Adaline Update

for each epoch:
    output = X.dot(w) + b          # net input for all samples
    errors = y - output            # continuous errors
    
    w += eta * (2/n) * X.T.dot(errors)   # update weights
    b += eta * (2/n) * errors.sum()      # update bias

The Learning Rate

The most important hyperparameter

What eta Controls

Step size in weight space

How much to move each update

Too Large

Overshoot the minimum

Bounce back and forth

May diverge to infinity

Loss explodes

Too Small

Tiny steps toward minimum

Takes forever to converge

May stop before reaching minimum

Training is too slow

Just Right

Steady descent toward minimum

Converges in reasonable time

Usually found by trial and error

Typical Values

Start with 0.01 or 0.1

If loss explodes, reduce by 10x

If too slow, increase by 10x

Modern optimizers adapt this automatically (Adam, RMSprop)

Feature Scaling

Why preprocessing matters

The Problem

Features on different scales cause problems

Age: 0-100

Income: 0-1,000,000

Gradient descent struggles with elongated loss surfaces

Standardization

x' = (x - mean) / std

Centers data at 0, scales to unit variance

Each feature now has mean=0, std=1

Why It Helps

Loss surface becomes more "spherical"

Gradient points more directly to minimum

Same learning rate works for all features

Converges much faster

In Code

X_std = (X - X.mean(axis=0)) / X.std(axis=0)

Always standardize before gradient descent

Batch vs Stochastic

Two flavors of gradient descent

Batch Gradient Descent

Compute gradient over all training samples

One update per epoch

Smooth convergence

Expensive for large datasets

Stochastic Gradient Descent (SGD)

Compute gradient for one sample at a time

n updates per epoch

Noisy but fast

Can escape shallow local minima

Mini-Batch

Best of both worlds

Compute gradient over small batches (32, 64, 128)

This is what everyone uses in practice

Online Learning

SGD enables learning from streaming data

Update model as new samples arrive

No need to store entire dataset

Adapts to changing distributions

The Code

AdalineGD implementation

The Class Structure

class AdalineGD:
    def __init__(self, eta=0.01, n_iter=50, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state
    
    def fit(self, X, y):
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=X.shape[1])
        self.b_ = np.float64(0.)
        self.losses_ = []
        
        # training loop on next slide...

The Training Loop

for i in range(self.n_iter):
    net_input = self.net_input(X)
    output = self.activation(net_input)
    errors = (y - output)
    
    self.w_ += self.eta * 2.0 * X.T.dot(errors) / X.shape[0]
    self.b_ += self.eta * 2.0 * errors.mean()
    
    loss = (errors**2).mean()
    self.losses_.append(loss)
    
return self

Helper Methods

def net_input(self, X):
    return np.dot(X, self.w_) + self.b_

def activation(self, X):
    return X  # identity function

def predict(self, X):
    return np.where(self.activation(self.net_input(X)) >= 0.5, 1, 0)

Training on Iris

# Standardize features
X_std = (X - X.mean(axis=0)) / X.std(axis=0)

# Train Adaline
ada = AdalineGD(eta=0.5, n_iter=20)
ada.fit(X_std, y)

# Check convergence
plt.plot(range(1, len(ada.losses_) + 1), ada.losses_)
plt.xlabel('Epochs')
plt.ylabel('MSE')

Live Demo

Open Interactive Demo

Compare Perceptron and Adaline side by side

Why This Matters

Adaline is not just historical

The Foundation

Gradient descent is how we train all neural networks

GPT? Gradient descent.

Image recognition? Gradient descent.

Self-driving cars? Gradient descent.

What Changes in Deep Learning

Adaline	Deep Networks
One layer	Many layers
Linear activation	Non-linear (ReLU, etc)
Direct gradient	Backpropagation
MSE loss	Cross-entropy, etc

But the core algorithm is the same

The Key Ideas

1. Define a loss function that measures error

2. Compute the gradient of loss w.r.t. weights

3. Update weights in the opposite direction

4. Repeat until convergence

Summary

Concept	What It Does
MSE Loss	Measures how wrong we are (continuously)
Gradient	Points uphill in weight space
Gradient Descent	Steps downhill to minimize loss
Learning Rate	Controls step size
Standardization	Makes optimization easier
SGD	Scales to large datasets

Resources

Essay: i33ym.cc/adaline-gradient-descent
Demo: i33ym.cc/demo-adaline
Code: github.com/i33ym/workshop

Next Time

Multi-Layer Perceptrons

Breaking the linear barrier with hidden layers

Questions?

i33ym.cc