0

I am performing a regression model using:

X = dataset2.iloc[:, 0:-1]
y = dataset2.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

Then, I extract the coefficients with:

coefficients = regressor.coef_

However, as I need the standard error (variance-covariance matrix) I am performing this manually by doing:

features = dataset2.iloc[:, 0:-1]

# N = number of observations
# k = number of independent regressors
N = len(X_train)
k = len(features.columns) + 1  # plus one because LinearRegression adds an intercept term

X_with_intercept = np.empty(shape=(N, k), dtype=float)
X_with_intercept[:, 0] = 1
X_with_intercept[:, 1:k] = X_train

# b = (X'X)^-1 X'y
# @ is the matrix multiplication operator
beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y_train
print(beta_hat)

which returns:

[ 0.    0.   -0.01 -0.    0.    0.   -0.   -0.   -0.   -0.   -0.  ]

On the other hand, coefficients return:

[0.0021308430119209416, -0.006294407027962639, -0.0021887043694901707, 0.004512777544097981, 0.000550417874231508, -0.0003297844194107745, -0.0019042607512515818, -0.0011443799090231155, -0.0012652793840597606, -0.0017634228809034023]

I'd like to increase the number of decimal places so I can compare the two methods properly. I tried using round(beta_hat, 6), but it didn't do the trick...

Source code of manual computation: Python scikit learn Linear Model Parameter Standard Error

Joehat
  • 979
  • 1
  • 9
  • 36
  • In your link, the answer uses `y.values` instead of just `y` in the `beta_hat` assignment. Is that intentional? (as you haven't posted what `y_train` is an object of) – pu239 Aug 14 '21 at 19:37
  • my y_train is already a numpy object. That's y_train.values was not needed. I've updated the question including where it's coming from. – Joehat Aug 14 '21 at 19:46
  • What do you get for `X.T@X` and its inverse? – Him Aug 14 '21 at 19:56
  • 1
    You're including y in your x feature set. Note that there are 11 coefs when you do your calculations manually. – Him Aug 14 '21 at 19:57
  • For te inverse of X.T@X I get: array([[ 1.68e+00, -2.12e-02, -1.31e-01, -4.78e-02, -3.84e-01, -7.22e-01, ... ]]) – Joehat Aug 14 '21 at 19:59
  • I exclude the label by excluding the last column in the feature set (-1). features = dataset2.iloc[:, 0:-1] – Joehat Aug 14 '21 at 20:00
  • The 11 is due to the intercept value – Joehat Aug 14 '21 at 20:01

1 Answers1

1

Answering my own question, just found out that doing list(beta_hat) prints more decimal values.

Joehat
  • 979
  • 1
  • 9
  • 36