Why I'm getting a huge cost in Stochastic Gradient Descent Implementation?

Question

I've run into some problems while trying to implement Stochastic Gradient Descent, and basically what is happening is that my cost is growing like crazy and I don't have a clue why.

MSE implementation:

def mse(x,y,w,b):
    predictions = x @ w 
    summed = (np.square(y - predictions - b)).mean(0)
    cost = summed / 2 
    return cost

Gradients:

def grad_w(y,x,w,b,n_samples):
    return -y @ x / n_samples + x.T @ x @ w / n_samples + b * x.mean(0)
def grad_b(y,x,w,b,n_samples):
    return -y.mean(0) + x.mean(0) @ w + b

SGD Implementation:

def stochastic_gradient_descent(X,y,w,b,learning_rate=0.01,iterations=500,batch_size =100):
    
    length = len(y)
    cost_history = np.zeros(iterations)
    n_batches = int(length/batch_size)
    
    for it in range(iterations):
        cost =0
        indices = np.random.permutation(length)
        X = X[indices]
        y = y[indices]
        for i in range(0,length,batch_size):
            X_i = X[i:i+batch_size]
            y_i = y[i:i+batch_size]

            w -= learning_rate*grad_w(y_i,X_i,w,b,length)
            b -= learning_rate*grad_b(y_i,X_i,w,b,length)
            
            cost = mse(X_i,y_i,w,b)
        cost_history[it]  = cost
        if cost_history[it] <= 0.0052: break
        
    return w, cost_history[:it]

Random Variables:

w_true = np.array([0.2, 0.5,-0.2])
b_true = -1
first_feature = np.random.normal(0,1,1000)
second_feature = np.random.uniform(size=1000)
third_feature = np.random.normal(1,2,1000)
arrays = [first_feature,second_feature,third_feature]
x = np.stack(arrays,axis=1) 
y = x @ w_true + b_true + np.random.normal(0,0.1,1000)
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 1.0

After running this:

theta,cost_history = stochastic_gradient_descent(x,y,w,b)

print('Final cost/MSE:  {:0.3f}'.format(cost_history[-1]))

I Get that:

Final cost/MSE:  3005958172614261248.000

And here is the plot

TQCH · Answer 1 · 2020-11-12T00:36:21.203

Here are a few suggestions:

your learning rate is too big for the training: changing it to something like 1e-3 should be fine.
your update part could be slightly modified as follows:

def stochastic_gradient_descent(X,y,w,b,learning_rate=0.01,iterations=500,batch_size =100):
    
    length = len(y)
    cost_history = np.zeros(iterations)
    n_batches = int(length/batch_size)
    
    for it in range(iterations):
        cost =0
        indices = np.random.permutation(length)
        X = X[indices]
        y = y[indices]
        for i in range(0,length,batch_size):
            X_i = X[i:i+batch_size]
            y_i = y[i:i+batch_size]

            w -= learning_rate*grad_w(y_i,X_i,w,b,len(X_i)) # the denominator should be the actual batch size
            b -= learning_rate*grad_b(y_i,X_i,w,b,len(X_i))
            
            cost += mse(X_i,y_i,w,b)*len(X_i) # add batch loss
        cost_history[it]  = cost/length # this is a running average of your batch losses, which is statistically more stable
        if cost_history[it] <= 0.0052: break
        
    return w, b, cost_history[:it]

The final results:

w_true = np.array([0.2, 0.5, -0.2])
b_true = -1
first_feature = np.random.normal(0,1,1000)
second_feature = np.random.uniform(size=1000)
third_feature = np.random.normal(1,2,1000)
arrays = [first_feature,second_feature,third_feature]
x = np.stack(arrays,axis=1) 
y = x @ w_true + b_true + np.random.normal(0,0.1,1000)
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 0.0
theta,bias,cost_history = stochastic_gradient_descent(x,y,w,b,learning_rate=1e-3,iterations=3000)

print("Final epoch cost/MSE:  {:0.3f}".format(cost_history[-1]))
print("True final cost/MSE: {:0.3f}".format(mse(x,y,theta,bias)))
print(f"Final coefficients:\n{theta,bias}")

score 0 · Answer 2 · answered Nov 23 '20 at 21:28

Hey @TQCH and thanks for that. I've come up with a different approach to implement SGD without an inner loop and the results were also pretty sweet.

def stochastic_gradient_descent(X,y,w,b,learning_rate=0.35,iterations=3000,batch_size =100):
    
    length = len(y)
    cost_history = np.zeros(iterations)
    n_batches = int(length/batch_size)
    marker = 0
    cost = mse(X,y,w,b)
    print(cost)
    for it in range(iterations):
        cost =0
        indices = np.random.choice(length, batch_size)
        X_i = X[indices]
        y_i = y[indices]

        w -= learning_rate*grad_w(y_i,X_i,w,b)
        b -= learning_rate*grad_b(y_i,X_i,w,b)
            
        cost = mse(X_i,y_i,w,b)
        cost_history[it]  = cost
        if cost_history[it] <= 0.0075 and cost_history[it] > 0.0071: marker = it
        if cost <= 0.0052: break
    print(f'{w}, {b}')
    return w, cost_history, marker, cost

w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 1.0
theta,cost_history, marker, cost = stochastic_gradient_descent(x,y,w,b)

print(f'Number of iterations: {marker}')
print('Final cost/MSE:  {:0.3f}'.format(cost))

which gave me these results:

1.9443112664859845,
[ 0.19592532 0.31735225 -0.20044424], -0.9059800816290591
Number of iterations: 68
Final cost/MSE: 0.005

But you're right I missed that I was dividing by total length of vector y and not by batch size and forgot to add batch loss!

Thanks for that!

Why I'm getting a huge cost in Stochastic Gradient Descent Implementation?

2 Answers2