Gradient descent for linear regression with numpy

Question

I want to implement gradient descent with numpy for linear regression but I have some error in this code:

import numpy as np

# Code Example
rng = np.random.RandomState(10)
X = 10*rng.rand(1000, 5) # feature matrix
y = 0.9 + np.dot(X, [2.2, 4, -4, 1, 2]) # target vector

# GD implementation for linear regression
def GD(X, y, eta=0.1, n_iter=20):
    theta = np.zeros((X.shape[0], X.shape[1]))
    for i in range(n_iter):
        grad = 2 * np.mean((np.dot(theta.T, X) - y) * X)
        theta = theta - eta * grad
    return theta

# SGD implementation for linear regression
def SGD(X, y, eta=0.1, n_iter=20):
    theta = np.zeros(1, X.shape[1])
    for i in range(n_iter):
        for j in range(X.shape[0]):
            grad = 2 * np.mean((np.dot(theta.T, X[j,:]) - y[j]) * X[j,:])
            theta = theta - eta * grad
    return theta

# MSE loss for linear regression with numpy
def MSE(X, y, theta):
    return np.mean((X.dot(theta.T) - y)**2)

# linear regression with GD and MSE with numpy
theta_gd = GD(X, y)
theta_sgd = SGD(X, y)

print('MSE with GD: ', MSE(X, y, theta_gd))
print('MSE with SGD: ', MSE(X, y, theta_sgd))

The error is

grad = 2 * np.mean((np.dot(theta.T, X) - y) * X)
ValueError: operands could not be broadcast together with shapes (5,5) (1000,)

and I can't solve it.

Welcome to Stack Overflow. `np.dot(theta.T, X).shape` is (5,5), but `y.shape` is (1000,). They can't be [broadcast together](https://numpy.org/doc/stable/user/basics.broadcasting.html) to do the subtraction (because of their shapes). To solve this problem you have to understand what you are trying to do with these operations. — Vladimir Fokow, Aug 13 '22 at 10:03
Thank you for your answer. I know what do you want to say and I have problem with gradient descent for linear regression and my question is not a code problem. I have question about my math and machine learning problem — Novinnam, Aug 13 '22 at 10:14
note that in your code `np.zeros((X.shape[0], X.shape[1]))` is equivalent to just `np.zeros(X.shape)` — Vladimir Fokow, Aug 13 '22 at 16:51

score 1 · Answer 1 · answered Aug 13 '22 at 17:28

Minor changes in your code that resolve dimensionality issues during matrix multiplication make the code run successfully. In particular, note that a linear regression on a design matrix X of dimension Nxk has a parameter vector theta of size k.

In addition, I'd suggest some changes in SGD() that make it a proper stochastic gradient descent. Namely, evaluating the gradient over random subsets of the data realized as realized by randomly partitioning the index set of the train data with np.random.shuffle() and looping through it. The batch_size determines the size of each subset after which the parameter estimate is updated. The argument seed ensures reproducibility.

# GD implementation for linear regression
def GD(X, y, eta=0.001, n_iter=100):
    theta = np.zeros(X.shape[1])
    for i in range(n_iter):
        for j in range(X.shape[0]):
            grad = (2 * np.mean(X[j,:] @ theta - y[j]) * X[j,:])  # changed line
            theta -= eta * grad
    return theta

# SGD implementation for linear regression
def SGD(X, y, eta=0.001, n_iter=1000, batch_size=25, seed=7678):
    theta = np.zeros(X.shape[1])
    indexSet = list(range(len(X)))
    np.random.seed(seed)
    for i in range(n_iter):
        np.random.shuffle(indexSet) # random shuffle of index set
        for j in range(round(len(X) / batch_size)+1):
            X_sub = X[indexSet[j*batch_size:(j+1)*batch_size],:]
            y_sub = y[indexSet[j*batch_size:(j+1)*batch_size]]
            if(len(X_sub) > 0):
                grad = (2 * np.mean(X_sub @ theta - y_sub) * X_sub)  # changed line
                theta -= eta * np.mean(grad, axis=0)
    return theta

Running the code, I get

print('MSE with GD : ',  MSE(X, y, theta_gd))
print('MSE with SGD: ', MSE(X, y, theta_sgd))
> MSE with GD :  0.07602
  MSE with SGD:  0.05762

Vladimir Fokow · Answer 2 · 2022-08-13T17:38:19.020

Each observation has 5 features, and X contains 1000 observations:

X = rng.rand(1000, 5) * 10  # X.shape == (1000, 5)

Create y which is perfectly linearly correlated with X (with no distortions):

real_weights = np.array([2.2, 4, -4, 1, 2]).reshape(-1, 1)
real_bias = 0.9
y = X @ real_weights + real_bias  # y.shape == (1000, 1)

G.D. implementation for linear regression:

Note: w (weights) is your theta variable. I have also added the calculation of b (bias).

def GD(X, y, eta=0.1, n_iter=20):
    # Initialize weights and a bias (all zeros):
    w = np.zeros((X.shape[1], 1))  # w.shape == (5, 1)
    b = 0
    # Gradient descent
    for i in range(n_iter):
        errors = X @ w + b - y  # errors.shape == (1000, 1)
        dw = 2 * np.mean(errors * X, axis=0).reshape(5, 1)
        db = 2 * np.mean(errors)
        w -= eta * dw
        b -= eta * db
    return w, b

Testing:

w, b = GD(X, y, eta=0.003, n_iter=5000)
print(w, b)

[[ 2.20464905]
 [ 4.00510139]
 [-3.99569374]
 [ 1.00444026]
 [ 2.00407476]] 0.7805448262466914

Notes:

Your function SGD also contains some error..
I'm using the @ operator because it's just my preference over np.dot.

Gradient descent for linear regression with numpy

2 Answers2