Optimizers : SGD, Momentum, AdaGrad, Adam

An overview of gradient descent optimization algorithms

May 15, 2022

What is Optimizer?

Optimizers are algorithms or methods used to minimize an error function(loss function)or to maximize the efficiency of production. The purpose of neural network learning is to find the parameters that make the value of loss function as small as possible. This is the problem of finding the optimal parameters. The process of solving this problem is called optimization.

The four optimizers the post mainly deals with are as follows:

Stochastic Gradient Descent (SGD)
Momentum
AdaGrad
Adam

This post will discuss and compare these optimizers. The codes are from the book Deep Learning From Scratch, published in September 2019.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent(SGD) is an iterative method for optimizing an objective function with suitable smoothness properties . SGD performs a parameter update for each training example x and label y.

class SGD:

    def __init__(self, lr=0.01):
        self.lr = lr
        
    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key]

Momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.

class Momentum:

    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            for key, val in params.items():                                
                self.v[key] = np.zeros_like(val)
                
        for key in params.keys():
            self.v[key] = self.momentum*self.v[key] - self.lr*grads[key] 
            params[key] += self.v[key]

AdaGrad

Adagrad is an algorithm for gradient-based optimization. It adapts the learning rate to the parameters, performing smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequent features.

class AdaGrad:

    def __init__(self, lr=0.01):
        self.lr = lr
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.h[key] += grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

Adam

Adaptive Moment Estimation (Adam)is a method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum.

class Adam:

    """Adam (<http://arxiv.org/abs/1412.6980v8>)"""

    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
        
    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = {}, {}
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)
                self.v[key] = np.zeros_like(val)
        
        self.iter += 1
        lr_t  = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)         
        
        for key in params.keys():
            #self.m[key] = self.beta1*self.m[key] + (1-self.beta1)*grads[key]
            #self.v[key] = self.beta2*self.v[key] + (1-self.beta2)*(grads[key]**2)
            self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])
            self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])
            
            params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)
            
            #unbias_m += (1 - self.beta1) * (grads[key] - self.m[key]) # correct bias
            #unbisa_b += (1 - self.beta2) * (grads[key]*grads[key] - self.v[key]) # correct bias
            #params[key] += self.lr * unbias_m / (np.sqrt(unbisa_b) + 1e-7)

Comparison of Optimizers

Let’s compare these 4 types of optimizers.

def f(x, y):
    return x**2 / 20.0 + y**2

def df(x, y):
    return  x / 10.0, 2.0*y

init_pos = (-7.0, 2.0)
params = {}
params['x'], params['y'] = init_pos[0], init_pos[1]
grads = {}
grads['x'], grads['y'] = 0, 0

optimizers = OrderedDict()
optimizers["SGD"] = SGD(lr=0.95)
optimizers["Momentum"] = Momentum(lr=0.1)
optimizers["AdaGrad"] = AdaGrad(lr=1.5)
optimizers["Adam"] = Adam(lr=0.3)

idx = 1

# Figure Size
fig, ax = plt.subplots()
fig.set_size_inches(10, 10)

for key in optimizers:
    optimizer = optimizers[key]
		x_history = [] 
    y_history = [] 
    params['x'], params['y'] = init_pos[0], init_pos[1]

    for i in range(30): #30 steps
        x_history.append(params['x']) # -7.0
        y_history.append(params['y']) # 2.0

# Set the initial position each iteration 
        grads['x'], grads['y'] = df(params['x'], params['y']) 
        # -7.0, 2.0
        optimizer.update(params, grads)
    
   # Set the area
    x = np.arange(-10, 10, 0.01)
    y = np.arange(-5, 5, 0.01)
    
    X, Y = np.meshgrid(x, y) 
    Z = f(X, Y)
    
    # Simply the mask
    mask = Z > 7
    Z[mask] = 0
    
    # plot 
    plt.subplot(1, 4, idx)
    idx += 1
    plt.plot(x_history, y_history, 'o-', color="orange")
    plt.contour(X, Y, Z)    
plt.show()

This is the output of code above. You can check how the optimizers go to the global minimum.

Riley Learning

Discussion about this post