Universal Approximation: How Neural Nets Learn

“The power of neural nets isn’t just in their depth—it’s in their ability to approximate reality.”

In the world of ml few ideas are as quietly revolutionary—as the Universal Approximation Theorem. It’s a sort of mathematical guarantee that forms the foundation of why neural networks work at all. This overview is made based on work of Cybenko, G. Bengio, Y. Goodfellow, I and others.

We’ll explore the intuition and practical relevance of the Universal Approximation Theorem (UAT), demystify what it promises (and doesn’t), and implement a live example of a neural network that approximates a nonlinear function.

What Is the Universal Approximation Theorem?

At its core, the UAT says:

A feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of , given appropriate weights and a non-linear activation function.

Frankly speaking I probably had to re-read this few times to understand what's actually going on here. So... breaking this down below:

Feedforward neural network:
No recurrence or loops—just data flowing forward.
Single hidden layer:
This isn't about deep networks; even shallow ones qualify.
Finite number of neurons:
In theory, you only need enough, not infinite.
Any continuous function:
Think sin(x), polynomials, tanh(x), etc.
Compact subset of :
Basically, bounded and closed regions.

Function as a Shape

The UAT says that a neural network, if shaped just right, can mold itself to match the contour of any smooth hill, curve, or valley. It’s like neural clay—flexible enough to copy anything, with the right sculptor (training) and tools (activation, architecture).

This doesn’t mean the approximation is always good or easy. The UAT guarantees existence, not efficiency. But it provides a profound insight: we don’t need new architectures to learn new tasks—we need new data and training techniques.

The Approximator in Practice: The MLP

The classical universal approximator is the multilayer perceptron (MLP)—a network with:

Input layer for features
One or more hidden layers with non-linear activations (sigmoid, ReLU, tanh)
Output layer that produces a scalar or vector

It’s the simplest and oldest form of a neural network—and it’s more powerful than it looks.

Requirements for Approximation

Non-linear activation
Without this, the network is just a stack of linear functions.
Sufficient neurons
More neurons = finer approximations.
Training
The right weights need to be learned through optimization.

Short Demo: Approximating sin(x)

Below is an approximation of the function on the interval using a small MLP. This tiny network captures the wave pattern accurately, demonstrating UAT in action.

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Generate data
x = torch.linspace(0, 2 * np.pi, 200).unsqueeze(1)
y = torch.sin(x)

# Define MLP
model = nn.Sequential(
    nn.Linear(1, 32),
    nn.Tanh(),
    nn.Linear(32, 32),
    nn.Tanh(),
    nn.Linear(32, 1)
)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Train
for epoch in range(1000):
    pred = model(x)
    loss = criterion(pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Plot
plt.plot(x.numpy(), y.numpy(), label='True')
plt.plot(x.numpy(), model(x).detach().numpy(), label='Predicted')
plt.legend()
plt.title("Approximating sin(x) with MLP")
plt.show()

What the Theorem Does Not Guarantee

Understanding what UAT doesn’t say is as important as what it does. This is where deeper architectures, regularization, and better data become crucial.

It doesn’t say the network will learn the function
Only that some configuration of weights exists that can do it.
It doesn’t specify the number of neurons required
For complex functions, this may be very large.
It doesn’t address generalization
A network might approximate a training set well and still fail on new data.

Deep vs Shallow Approximation

Although the UAT shows that a single hidden layer is enough, deep networks often perform better in practice. Why?

Depth enables compositional learning
Layers capture hierarchies.
Parameter efficiency
Deep networks may use fewer neurons overall.
Better generalization
With the right constraints, depth encourages smoother interpolation.

Applications of the Universal Approximator

UAT nderlies major innovations:

Function regression
Predicting any numerical output
Signal approximation
Denoising audio, reconstructing signals
Neural style transfer
Approximating aesthetic transforms
Physics-informed neural networks (PINNs)
Solving PDEs via approximation

Power, With Limits

The Universal Approximation Theorem tells us something profound: neural networks are capable of immense representational power. But power doesn’t equal wisdom. The art of machine learning is guiding that power toward meaningful generalization, not just memorization.

References

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press.

What Is the Universal Approximation Theorem?

Function as a Shape

The Approximator in Practice: The MLP

Requirements for Approximation

Short Demo: Approximating sin(x)

What the Theorem Does Not Guarantee

Deep vs Shallow Approximation

Applications of the Universal Approximator

Power, With Limits

Share Article:

Platform Migration: Replatforming as a Key to eCommerce Success

More in this Category AI

Accelerated NLP Pipeline with NVIDIA & Databricks

Notes on Ultrascale Playbook

Retrieval-Augmented Generation (RAG)

The Learning Curve Theory