What are activation functions in neural networks, what kind of activations are there, when to use which, and why do they matter?

What are activation functions?

In the context of neural networks, an activation function is a mathematical operation applied to a neuron's output before passing it to the next layer. It introduces non-linearity into the model, allowing the network to learn complex patterns in the data.

Why Non-Linearity Matters

If you stacked multiple linear layers without any non-linear activations, the entire network would still collapse into a single linear transformation. In other words, no matter how many layers you add, the network could only ever represent straight decision boundaries.

Activation functions introduce non-linearities between layers, which let networks approximate arbitrarily complex functions. That's the concept of Deep Learning. With enough non-linearities and data, neural nets can model intricate decision surfaces like speech, images, or language.

A simple example:

A network with no activations can only model a line.
A network with ReLU activations can model piecewise linear shapes.
With sigmoid/tanh or smooth activations, it can model curves.

Common Activation Functions

Each activation has its own properties, advantages, and drawbacks. Below is a list of the most commonly used functions.

Hint:

You can hover over the plots to see the derivative at that point.

Why is the derivative important here?

The derivative of the activation function is actually crucial for the backpropagation algorithm, which is used to train neural networks.

During backpropagation, gradients are calculated via Stochastic gradient descent (SGD) to update the weights of the network. The derivative of the activation function determines how much influence a neuron's output has on the loss function.

Shamless self-plug: I talk about that more here.

Sigmoid

It: Squashes numbers into a range between 0 and 1.
Formula: $σ (x) = \frac{1}{1 + e^{- x}}$ .
Derivative: $σ^{'} (x) = σ (x) (1 - σ (x))$ .
Good for: Binary classification tasks.
Used for: Probability predictions.
Bad for: Deep networks (can cause vanishing gradients).
Range: $(0, 1)$
Order of continuity: $C^{\infty}$

Tanh (Hyperbolic Tangent)

It: Squashes numbers into a range between -1 and 1.
Formula: $\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$ .
Derivative: $\tanh^{'} (x) = 1 - \tanh^{2} (x)$ .
Good for: Centering data around zero.
Used for: Sentiment analysis, time series prediction.
Bad for: Deep networks (can also cause vanishing gradients).
Range: $(- 1, 1)$
Order of continuity: $C^{\infty}$

ReLU (Rectified Linear Unit)

It: Outputs the input directly if positive; otherwise, it outputs zero.
Formula: $ReLU (x) = max (0, x)$ .
Derivative:

{ReLU}^{'} (x) = {\begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}

Good for: Most hidden layers in deep networks (efficient and effective).
Used for: Image recognition, NLP tasks.
Bad for: Can lead to "dying ReLUs" where neurons output zero for all inputs.
Range: $[0, \infty)$
Order of continuity: $C^{0}$

Leaky ReLU

It: Similar to ReLU but gives a small negative slope for negative inputs.
Formula:

Leaky ReLU (x) = max (0.01 x, x)

Derivative:

{Leaky ReLU}^{'} (x) = {\begin{cases} 1 & x > 0 \\ 0.01 & x \leq 0 \end{cases}

Good for: Preventing dying ReLUs.
Used for: Fraud detection, anomaly detection.
Bad for: Still not perfect. Can be less interpretable.
Range: $(- \infty, \infty)$
Order of continuity: $C^{0}$

PReLU (Parametric ReLU)

It: An extension of Leaky ReLU where the slope for negative inputs is learned during training.
Formula: $PReLU (x) = max (α x, x)$ where $α$ is a learned parameter.
Derivative:

{PReLU}^{'} (x) = {\begin{cases} 1 & x > 0 \\ α & x \leq 0 \end{cases}

Note: $α$ can be learned per neuron or shared per layer, depending on the implementation.
Good for: Allowing the model to adaptively learn the best activation.
Used for: Image recognition, deep networks.
Bad for: More parameters to learn, which can lead to overfitting.
Range: $(- \infty, \infty)$
Order of continuity: $C^{0}$

GELU (Gaussian Error Linear Unit)

It: Combines properties of ReLU and sigmoid, providing smooth activation.
Formula: $GELU (x) = x \cdot Φ (x)$ where $Φ (x)$ is the standard normal cumulative distribution function (CDF).
Derivative: Complex, involves both $Φ (x)$ and the PDF $ϕ (x)$ .
- Note: Commonly computed automatically by frameworks.
Good for: Deep learning models, especially transformers.
Used for: NLP tasks, image recognition.
Bad for: More complex to compute than ReLU (though efficient in modern libraries).
Range: $(- \infty, \infty)$
Order of continuity: $C^{\infty}$

Softmax

It: Converts a vector of values into probabilities that sum to 1.
Formula: $Softmax (x_{i}) = \frac{e^{x_{i}}}{\sum_{j} e^{x_{j}}}$ .
Derivative:

\frac{\partial Softmax (x)_{i}}{\partial x_{j}} = Softmax (x)_{i} (δ_{i j} - Softmax (x)_{j})

Note: Involves the Jacobian matrix.
Good for: Multi-class classification tasks.
Used for: Image classification, language modeling.
Bad for: Not suitable for regression tasks.
Range: $(0, 1)$ for each output, sums to 1
Order of continuity: $C^{\infty}$

Note: This plot uses a simplified 3-class softmax for illustration. Real softmax operates on full vectors.

Softplus

It: A smooth approximation of ReLU.
Formula: $Softplus (x) = \ln (1 + e^{x})$ .
Derivative: ${Softplus}^{'} (x) = σ (x)$ .
- Note: The derivative is the sigmoid function.
Good for: Situations where a smooth function is preferred.
Used for: Regression tasks, probabilistic models.
Bad for: Can be slower to compute than ReLU.
Range: $(0, \infty)$
Order of continuity: $C^{\infty}$

Swish / SiLU (Sigmoid Linear Unit)

It: A smooth, non-monotonic function that can outperform ReLU in some cases.
Formula: $Swish (x) = x \cdot σ (x)$ .
Derivative: ${Swish}^{'} (x) = σ (x) + x \cdot σ^{'} (x)$ .
Good for: Deep networks where ReLU might struggle.
Used for: Image classification, NLP tasks.
Bad for: More computationally expensive than ReLU.
Range: $(- \infty, \infty)$
Order of continuity: $C^{\infty}$

Gaussian

It: A bell-shaped curve that is often used in statistics and machine learning.
Formula: $f (x) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{(x - μ)^{2}}{2 σ^{2}}}$ .
Derivative: $f^{'} (x) = - \frac{x - μ}{σ^{2}} f (x)$ .
Good for: Modeling continuous data, anomaly detection.
Used for: Probabilistic models, density estimation.
Bad for: Rarely used as an activation in modern deep learning architectures.
Range: $(0, 1]$
Order of continuity: $C^{\infty}$

Disclaimer:
Gaussian functions are typically used in radial basis function (RBF) networks or probabilistic models,
not as standard activations in deep networks like CNNs or transformers.

Choosing the Right Activation Function

Task / Layer	Recommended Activation	Reason	Performance
Hidden layers (most tasks)	ReLU / Leaky ReLU	Simple, fast, effective	Very fast
Deep NLP / Transformers	GELU / Swish	Smooth, better gradient flow	Slightly slower but optimized
Output for binary classification	Sigmoid	Converts to probability	Efficient
Output for multi-class classification	Softmax	Normalized class probabilities	Standard
Regression outputs	Linear (no activation)	Unbounded range	Fastest

Note:

Modern frameworks optimize GELU, Swish, and Softplus well, so their extra compute cost is usually negligible in practice.

Common Pitfalls

Vanishing gradients: Sigmoid and tanh squish inputs into small ranges, which can make gradients vanish in deep networks.
- See Vanishing gradient problem
Dying ReLUs: Large learning rates can push neurons permanently into the negative regime, making them output 0 forever.
- See Potential Problems of ReLUs
Exploding outputs: Stacking linear activations or using softmax incorrectly can lead to numerical instability.
Misusing softmax: Applying softmax to regression outputs or applying it twice can distort predictions.
Overfitting with PReLU: Learning too many extra parameters can overfit small datasets.

Activation Functions

What are activation functions? ​

Why Non-Linearity Matters ​

Common Activation Functions ​

Sigmoid ​

Tanh (Hyperbolic Tangent) ​

ReLU (Rectified Linear Unit) ​

Leaky ReLU ​

PReLU (Parametric ReLU) ​

GELU (Gaussian Error Linear Unit) ​

Softmax ​

Softplus ​

Swish / SiLU (Sigmoid Linear Unit) ​

Gaussian ​

Choosing the Right Activation Function ​

Common Pitfalls ​