Batch Normalization

Understanding Batch Normalization with Codes Explained

May 22, 2022

What is Batch Normalization?

Normalization means normalizing the data dimensions so that they are of approximately the same scale. Batch normalization is a technique for training very deep neural networks that normalizes the contributions to a layer for every mini-batch. Batch Norm is an essential part of the most of deep learning implementation.

Batch Normalization is normalizing the hidden units activation values so that the distribution of these activations remains same during training. Batch Normalization has the impact of settling the learning process and drastically decreasing the number of training epochs required to train deep neural networks.

We will assume that X is of size [N x D]. N is the number of data and D is their dimensionality. To get variance, we need to subtract the mean across every individual feature in the data.

Gamma and Beta are learned along with the other parameters of the network. There would be update rule for gamma and beta and this update rule would depend upon the derivative of the loss function with respect to gamma and beta.

Advantages of Batch Normalization

Here are some benefits of Batch Normalization:

The model is less delicate to hyperparameter tuning. That is, though bigger learning rates prompted non-valuable models already, bigger LRs are satisfactory at this point
Shrinks internal covariant shift
Diminishes the reliance of gradients on the scale of the parameters or their underlying values
Weight initialization is a smidgen less significant at this point
Dropout can be evacuated for regularization

Batch Normalization Layer

Batch Normalization normalizes batch data (output of Affine or Conv) before activation. Batch Normalization Layer is consist of forward pass and backward pass. The codes are from the book ‘Deep Learning From Scratch’, published in September 2019.

Initialization of Batch Normalization Layer

Temporary variables in forward() and backward() defined in the initialization part.

class BatchNormalization:

    def __init__(self, gamma, beta, momentum=0.9, running_mean=None, running_var=None):
        self.gamma = gamma # scale after normalization
        self.beta = beta # shift after normalization
        self.momentum = momentum # smoothing factor of EMA, momentum * newest value
        
				# temporary variables in forward()
				self.input_shape = None # Conv:4dim, Affine:2dim

        # mean and variance for test session 
        self.running_mean = running_mean 
        self.running_var = running_var  
        
        # temporary variables in backward()
        self.batch_size = None
        self.xc = None # centered batch data
        self.std = None
        self.dgamma = None
        self.dbeta = None

Forward Pass of Batch Normalization Layer

Forward Pass has two parts. The second one(__forward) is the actual forward process of batch normalization.

    def forward(self, x, train_flg=True): #train_flg=False when testing 
        self.input_shape = x.shape # hold original shape here
        if x.ndim != 2: # If the previous layer is not Affine(ndim=2)
            N, C, H, W = x.shape
            x = x.reshape(N, -1) 
        out = self.__forward(x, train_flg)
        return out.reshape(*self.input_shape) # recover the shape of x after normalization

    def __forward(self, x, train_flg):
        if self.running_mean is None:
            N, D = x.shape # In the first iteration,setting moving average and var as 0.
            self.running_mean = np.zeros(D)
            self.running_var = np.zeros(D)
                        
        if train_flg: # When training
            mu = x.mean(axis=0) # find mean for each column
            xc = x - mu # mean subtraction 
            var = np.mean(xc**2, axis=0)
            std = np.sqrt(var + 10e-7) # + 10e-7 protects the vale from 0
            xn = xc / std #normalize
            
            self.batch_size = x.shape[0]
            self.xc = xc
            self.xn = xn
            self.std = std
            self.running_mean = self.momentum * self.running_mean + (1-self.momentum) * mu
            self.running_var = self.momentum * self.running_var + (1-self.momentum) * var            
        else: # When testing
            xc = x - self.running_mean
            xn = xc / ((np.sqrt(self.running_var + 10e-7)))
            
        out = self.gamma * xn + self.beta # gamma and beta are defined after training
        return out

Backward Pass of Batch Normalization Layer

This process is to find dgamma and dbeta, which is used to update beta and gamma.

    def backward(self, dout):
        if dout.ndim != 2:
            N, C, H, W = dout.shape
            dout = dout.reshape(N, -1)

        dx = self.__backward(dout)

        dx = dx.reshape(*self.input_shape)
        return dx

    def __backward(self, dout):
        dbeta = dout.sum(axis=0)
        dgamma = np.sum(self.xn * dout, axis=0) # Hadamard product
        dxn = self.gamma * dout
        dxc = dxn / self.std 
        dstd = -np.sum((dxn * self.xc) / (self.std * self.std), axis=0)
        dvar = 0.5 * dstd / self.std
        dxc += (2.0 / self.batch_size) * self.xc * dvar
        dmu = np.sum(dxc, axis=0)
        dx = dxc - dmu / self.batch_size
        
        self.dgamma = dgamma
        self.dbeta = dbeta
        
        return dx

Batch Norm is a very useful layer. If you are interested in Deep Learning, you will for sure have to get familiar with this method. I hope this post gives you a good understanding of how Batch Norm works.

Riley Learning

Discussion about this post