Batch Normalization in Neural Networks

Pranav Srivastava
4 min readMay 25, 2020

Batch normalization is an algorithm to overcome the problem of internal covariance shift in a deep neural network with mini-batches. It was introduced in 2015 by Sergey Ioffe and Christian Szegedy. In order to understand batch normalization in detail, we have to understand these terms — normalization, mini-batches and covariance shift. Below is the description.

a) Normalization — In general terms normalization is a preprocessing step to standardize the data before it is sent for training in a neural network. In other words, normalization is applied to put all data points at the same scale.

Why do we apply normalization?

If the instances within a feature or various features for various instances are not on the same scale, then the neural networks will be quite unstable. Also, it could lead to imbalanced gradients in the gradient descent step.

b) Mini-batches — These are sub-batches of a training data batch that are fed to a neural network for training.

Why to use mini-batches?

When a large data is fed to a deep neural network then for each epoch the weights and bias are updated only once, which results in training to have a very slow rate. Also, it becomes hard to analyze alternate model architecture as all the time spent in training one large dataset results only produces one set of weights and bias. In order to overcome this problem one large batch is divided into multiple mini-batches (typical size could be 32, 64, 128 and so on depending on the volume of data) and each mini-batch is assigned a different weight and bias. All the mini-batches will be processed and will result in the numbers of weights and bias equivalent to the size of mini-batch selected. This will occur in one epoch (1 epoch occurs when we have pushed all the training data through the network) and will be significantly faster. Model analysis will become easier with multiple combinations of weights and bias.

c) Covariance shift — Change in the distribution of inputs to the deep layers in a neural network is called covariance shift or internal covariance shift [2]. Change in the distribution is caused when mini-batches are applied in a neural network and each mini-batch gets a random weight and bias assigned. In other words, let’s say during training one weight becomes drastically larger than other weights and results in a larger output from its corresponding neuron and this imbalance will cascade through the deeper layer in the neural network and will eventually result in an instability in the neural network.

In the example below we can see weight w=40 drastically larger than the other weights and will cause output from it’s corresponding neuron to be extremely large and the imbalance will cascade further and will result in internal covariance shift. This problem can occur in any layer of the network.

example of covariance shift when using mini-batches

Batch normalization is used to standardize the layer inputs and the layers where we want to apply this technique can be chosen.

How to apply batch normalization?

In order to apply this technique, it is important to the calculate mean (m) and standard deviation (s) of each input variable (x) to a layer per mini-batch. Normalization (x̂) is applied further using the formula below.

x̂ = (x — m) / s

Note — simply normalizing each input of a layer can change what a layer can represent. It is important that the transformation inserted in the network can represent identity transform (data transformation that copies the source data into the destination data without change). In order to achieve this pair of parameters are used which shift and scale the normalized value. In other words, if scale and shift is not applied then normalized outputs will be mostly close to 0 which would hinder the network ability to fully utilize non-linear transformations.

Further, scale and shift is applied, using parameters γ and β, to obtain batch normalized transform (y) using the formula below. γ and β parameters are learned along with the original model parameters.

y = γ x̂ + β

The steps explained above can be also be checked in the snippet below from the paper [1]

Batch Normalizing Transform applied to activation x over a mini-batch. [1]

As the mini-batches are used in stochastic gradient training, each mini-batch will produce it’s mean and variance for each activation.

Other advantages

The problem of internal covariance shift is solved using batch normalization. Also, this technique helps in gaining below advantages:

a) It results in the network to become more stable and therefore high learning rates can be chosen, which can in turn result in speeding up the learning process.

b) There is no need to apply dropout regularization as batch normalization offers some regularization effect.

c) It helps in making deep neural networks become less sensitive to the choice to weights made for training as the networks become quite stable after batch normalization is applied.

References:

[1] https://arxiv.org/pdf/1502.03167.pdf

[2] https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/

[3] https://ieeexplore.ieee.org/document/8079081

--

--

No responses yet