0

For Linear Regression with 1 Variable and an Intercept, I can compute the RSquare as -

R^2 = (np.sum(((x - np.mean(x)) / np.std(x, ddof=1)) * ((y - np.mean(y)) / np.std(y, ddof=1))) / (len(x) - 1)) ** 2

How do I compute R Square for Linear Regression with 1 Variable and without an intercept, and without having to deal with statsmodels.api OLS or linregress or any of the third party packages. Is the understanding correct that np.mean(y) = 0 for Linear Regresssion without intercept?

What is the fastest way in numpy to get the RSquare for Linear Regression with 1 Variable and no intercept?

godimedia
  • 987
  • 2
  • 12
  • 28

3 Answers3

2

For ordinary least-squares (OLS), we can concisely solve a univariate linear regression in using the normal-form formula,

We can implement this in NumPy using column vectors for X and y:

import numpy as np
X = np.array([[0, 1, 2, 3, 4, 5]]).T
y = np.array([[1, 0, 3, 2, 5, 4]]).T

w = np.linalg.inv(X.T @ X) @ X.T @ y

y_hat = X @ w

Above, I use the @ operator for matrix multiplication. y_hat stores the linear regression's y-coordinates for each X-coordinate. One can then compute R² using sum-of-squares formula:

SS_tot = (y - np.mean(y)).T @ (y - np.mean(y))
SS_reg = (y - y_hat).T @ (y - y_hat)
r_squared = 1 - (SS_reg / SS_tot)

Above, I'm using vector inner products to write the sum of squares more concisely, as summing the squares of a vector v is the same as transposing v and multiplying it by itself.

This computes R² for your regression (stored in a 1x1 NumPy matrix), whose slope is stored in w (which is represented as a 1x1 NumPy matrix).

dsillman2000
  • 976
  • 1
  • 8
  • 20
  • Based on your example, if you try to match the RSq with OLS - `sm.OLS(y, X).fit().rsquared` , it's not matching. Am I missing something here? – godimedia Jul 03 '23 at 17:23
  • 1
    Hi @godimedia, yes I messed up the R2 formula - I was cutting corners in my formula for variance and wrongly used it to express R2. Apologies! – dsillman2000 Jul 03 '23 at 17:32
  • 1
    @godimedia After some digging, I got down to the reason why statsmodels disagrees with the (corrected) formula above. https://stackoverflow.com/questions/70179307/why-is-sklearn-r-squared-different-from-that-of-statsmodels-when-fit-intercept-f/70180217#70180217 You can use the un-centered sum of squares in the denominator by removing the `np.mean` terms from the SS_tot expression. – dsillman2000 Jul 03 '23 at 21:06
  • Thank you, Yes, I agree, it's the un-centered sum, hence I mentioned in my question about `np.mean(y) =0`. – godimedia Jul 04 '23 at 20:18
  • Thank you so much for your response and answer, I've accepted Onyambu's answer since it gets me to what I needed and it works out as the fastest way to get RSquare and beta. – godimedia Jul 05 '23 at 15:21
2

In the case of one variable with no intercept, you could easily do:

sum(x*y)**2/sum(x*x)/sum(y*y)

In matrix notation this can be written as

 (y @ x)**2/(x @ x * y @ y)

For example:

import statsmodels.api as sm
x, y = sm.datasets.get_rdataset('iris').data.iloc[:,:2].values.T
print(sm.OLS(y,x).fit().rsquared)
0.9565098243072627

print((y @ x)**2/(x @ x * y @ y))
0.9565098243072627

Note that the two are equivalent


You could extend the above to include multiple variables:

import statsmodels.api as sm, numpy as np

dat = sm.datasets.get_rdataset('iris').data
x = dat.iloc[:,1:4].values
y = dat.iloc[:,0].values

print(sm.OLS(y, x).fit().rsquared)
0.9961972754365206

print(y @ x @ np.linalg.solve(x.T @ x, x.T @ y)  / (y @ y))
0.9961972754365208
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • 1
    Kudos for the concise notation. Can you provide an explanation for how the expression is derived from first principles? – dsillman2000 Jul 03 '23 at 21:11
  • 1
    @dsillman2000 I just used `SSR/SST` with no intercept, SST = ||y||^2= = y@y. Now without intercept SSR = ||||^2. You can easily show that SSR is given as the provided code – Onyambu Jul 03 '23 at 21:19
  • @onyambu Thanks a ton, that makes sense. Do you think `np.linalg.inv(X.T @ X) @ X.T @ y` will give me the beta for the same? This is extremely fast and works for me, as I'm looking to perform million of those regressions and just need beta and rsq. – godimedia Jul 05 '23 at 14:17
  • 1
    @godimedia thats how beta is calculated. You can always compare to `sm.OLS`. Also depending on your data size, you could use svd/qr to compute beta. – Onyambu Jul 05 '23 at 15:15
  • @Onyambu yea - I think this is faster. `np.linalg.solve(x.reshape(-1,1).T @ x.reshape(-1,1), x.reshape(-1,1).T @ y)[0]` as compared to `(np.linalg.inv(x.reshape(-1,1).T @ x.reshape(-1,1)) @ x.reshape(-1,1).T @ y)[0]`. Thanks a lot, once again, I see that you've a buy me a coffee page, will do so soon. – godimedia Jul 05 '23 at 15:19
  • 1
    @godimedia if you have 1 variable, i would suggest comparing that to `(y @ x)**2/(x @ x * y @ y)` or `(y*x).sum()**2/(x*x).sum()/(y*y).sum()` and check which one is faster. – Onyambu Jul 05 '23 at 15:22
  • 1
    @Onyambu I did check the former one is faster since it's raw numpy arrays in my case. – godimedia Jul 05 '23 at 15:27
  • @Onyambu For betas, is this the right way to do it? `np.linalg.solve(x.reshape(-1,1).T @ x.reshape(-1,1), x.reshape(-1,1).T @ y)[0]` What's the svd/qr way to compute beta? – godimedia Jul 05 '23 at 15:29
  • 1
    @godimedia Just looked into the `solve` function of numpy and found out it does the LU factorization. hence no need of the svd/qr factorization. Note that the aim of `solve` is to avoid using `inverse` to invert the matrix. Also this will be much faster if you had a wide data set ie many variables – Onyambu Jul 05 '23 at 15:36
  • Thanks a ton! Yes, I agree. Cheers, man! – godimedia Jul 05 '23 at 15:49
0

you can compute r-squared without intercept

R2 = 1 - np.sum((y - (slope * x))**2) / np.sum((y - np.mean(y))**2)
Dejene T.
  • 973
  • 8
  • 14