What are activation functions in neural networks, what kind of activations are there, when to use which, and why do they matter?
What are activation functions?
In the context of neural networks, an activation function is a mathematical operation applied to a neuron's output before passing it to the next layer. It introduces non-linearity into the model, allowing the network to learn complex patterns in the data.
Why Non-Linearity Matters
If you stacked multiple linear layers without any non-linear activations, the entire network would still collapse into a single linear transformation. In other words, no matter how many layers you add, the network could only ever represent straight decision boundaries.
Activation functions introduce non-linearities between layers, which let networks approximate arbitrarily complex functions. That's the concept of Deep Learning. With enough non-linearities and data, neural nets can model intricate decision surfaces like speech, images, or language.
A simple example:
- A network with no activations can only model a line.
- A network with ReLU activations can model piecewise linear shapes.
- With sigmoid/tanh or smooth activations, it can model curves.
Common Activation Functions
Each activation has its own properties, advantages, and drawbacks. Below is a list of the most commonly used functions.
Hint:
You can hover over the plots to see the derivative at that point.
Why is the derivative important here?
The derivative of the activation function is actually crucial for the backpropagation algorithm, which is used to train neural networks.
During backpropagation, gradients are calculated via Stochastic gradient descent (SGD) to update the weights of the network. The derivative of the activation function determines how much influence a neuron's output has on the loss function.
Shamless self-plug: I talk about that more here.
Sigmoid
- It: Squashes numbers into a range between 0 and 1.
- Formula:
. - Derivative:
. - Good for: Binary classification tasks.
- Used for: Probability predictions.
- Bad for: Deep networks (can cause vanishing gradients).
- Range:
- Order of continuity:
Tanh (Hyperbolic Tangent)
- It: Squashes numbers into a range between -1 and 1.
- Formula:
. - Derivative:
. - Good for: Centering data around zero.
- Used for: Sentiment analysis, time series prediction.
- Bad for: Deep networks (can also cause vanishing gradients).
- Range:
- Order of continuity:
ReLU (Rectified Linear Unit)
- It: Outputs the input directly if positive; otherwise, it outputs zero.
- Formula:
. - Derivative:
- Good for: Most hidden layers in deep networks (efficient and effective).
- Used for: Image recognition, NLP tasks.
- Bad for: Can lead to "dying ReLUs" where neurons output zero for all inputs.
- Range:
- Order of continuity:
Leaky ReLU
- It: Similar to ReLU but gives a small negative slope for negative inputs.
- Formula:
- Derivative:
- Good for: Preventing dying ReLUs.
- Used for: Fraud detection, anomaly detection.
- Bad for: Still not perfect. Can be less interpretable.
- Range:
- Order of continuity:
PReLU (Parametric ReLU)
- It: An extension of Leaky ReLU where the slope for negative inputs is learned during training.
- Formula:
where is a learned parameter. - Derivative:
- Note:
can be learned per neuron or shared per layer, depending on the implementation. - Good for: Allowing the model to adaptively learn the best activation.
- Used for: Image recognition, deep networks.
- Bad for: More parameters to learn, which can lead to overfitting.
- Range:
- Order of continuity:
GELU (Gaussian Error Linear Unit)
- It: Combines properties of ReLU and sigmoid, providing smooth activation.
- Formula:
where is the standard normal cumulative distribution function (CDF). - Derivative: Complex, involves both
and the PDF . - Note: Commonly computed automatically by frameworks.
- Good for: Deep learning models, especially transformers.
- Used for: NLP tasks, image recognition.
- Bad for: More complex to compute than ReLU (though efficient in modern libraries).
- Range:
- Order of continuity:
Softmax
- It: Converts a vector of values into probabilities that sum to 1.
- Formula:
. - Derivative:
- Note: Involves the Jacobian matrix.
- Good for: Multi-class classification tasks.
- Used for: Image classification, language modeling.
- Bad for: Not suitable for regression tasks.
- Range:
for each output, sums to 1 - Order of continuity:
Softplus
- It: A smooth approximation of ReLU.
- Formula:
. - Derivative:
. - Note: The derivative is the sigmoid function.
- Good for: Situations where a smooth function is preferred.
- Used for: Regression tasks, probabilistic models.
- Bad for: Can be slower to compute than ReLU.
- Range:
- Order of continuity:
Swish / SiLU (Sigmoid Linear Unit)
- It: A smooth, non-monotonic function that can outperform ReLU in some cases.
- Formula:
. - Derivative:
. - Good for: Deep networks where ReLU might struggle.
- Used for: Image classification, NLP tasks.
- Bad for: More computationally expensive than ReLU.
- Range:
- Order of continuity:
Gaussian
- It: A bell-shaped curve that is often used in statistics and machine learning.
- Formula:
. - Derivative:
. - Good for: Modeling continuous data, anomaly detection.
- Used for: Probabilistic models, density estimation.
- Bad for: Rarely used as an activation in modern deep learning architectures.
- Range:
- Order of continuity:
Gaussian functions are typically used in radial basis function (RBF) networks or probabilistic models,
not as standard activations in deep networks like CNNs or transformers.
Choosing the Right Activation Function
Task / Layer | Recommended Activation | Reason | Performance |
---|---|---|---|
Hidden layers (most tasks) | ReLU / Leaky ReLU | Simple, fast, effective | Very fast |
Deep NLP / Transformers | GELU / Swish | Smooth, better gradient flow | Slightly slower but optimized |
Output for binary classification | Sigmoid | Converts to probability | Efficient |
Output for multi-class classification | Softmax | Normalized class probabilities | Standard |
Regression outputs | Linear (no activation) | Unbounded range | Fastest |
Note:
Modern frameworks optimize GELU, Swish, and Softplus well, so their extra compute cost is usually negligible in practice.
Common Pitfalls
- Vanishing gradients: Sigmoid and tanh squish inputs into small ranges, which can make gradients vanish in deep networks.
- Dying ReLUs: Large learning rates can push neurons permanently into the negative regime, making them output 0 forever.
- Exploding outputs: Stacking linear activations or using softmax incorrectly can lead to numerical instability.
- Misusing softmax: Applying softmax to regression outputs or applying it twice can distort predictions.
- Overfitting with PReLU: Learning too many extra parameters can overfit small datasets.