3

I am trying to replicate the functionality of Statsmodels's weight least squares (WLS) function with Numpy's ordinary least squares (OLS) function (i.e. Numpy refers to OLS as just "least squares").

In other words, I want to compute the WLS in Numpy. I used this Stackoverflow post as reference, but drastically different R² values arise moving from Statsmodel to Numpy.

Take the following example code that replicates this:

import numpy as np
import statsmodels.formula.api as smf
import pandas as pd

# Test Data
patsy_equation = "y ~ C(x) - 1" # Use minus one to get ride of hidden intercept of "+ 1"
weight = np.array([0.37, 0.37, 0.53, 0.754])
y = np.array([0.23, 0.55, 0.66, 0.88])
x = np.array([3, 3, 3, 3])
d = {"x": x.tolist(), "y": y.tolist()}
data_df = pd.DataFrame(data=d)

# Weighted Least Squares from Statsmodel API
statsmodel_model = smf.wls(formula=patsy_equation, weights=weight, data=data_df)
statsmodel_r2 = statsmodel_model.fit().rsquared      

# Weighted Least Squares from Numpy API
Aw = x.reshape((-1, 1)) * np.sqrt(weight[:, np.newaxis]) # Multiply two column vectors
Bw = y * np.sqrt(weight)
numpy_model, numpy_resid = np.linalg.lstsq(Aw, Bw, rcond=None)[:2]
numpy_r2 = 1 - numpy_resid / (Bw.size * Bw.var())

print("Statsmodels R²: " + str(statsmodel_r2))
print("Numpy R²: " + str(numpy_r2[0]))

After running such code, I get the following results:

Statsmodels R²: 2.220446049250313e-16
Numpy R²: 0.475486515775414

Clearly something is wrong here! Can anyone point out my flaws here? Am I miss understanding the patsy formula?

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
Code Doggo
  • 2,146
  • 6
  • 33
  • 58
  • Have you verified that the two methods give the same solution to the weighted least squares problem? If they don't, then there is no point in checking R². – Warren Weckesser May 25 '18 at 16:10
  • 1
    All your `x` values are the same! How do you expect to fit a line to that data? – Warren Weckesser May 25 '18 at 16:17
  • On second thought, apparently you want the line to go through the origin. In that case, a few y values at a single x value gives a well-defined problem. You could compute the slope by taking the weighted average of those y values, and then dividing by x: `np.average(y, weights=weight)/3` gives `0.21441370223978917`, which agrees with `numpy_model`. – Warren Weckesser May 25 '18 at 17:14
  • @WarrenWeckesser How does computing the slope find the R² value? Did I compute an incorrect `numpy_R²` value? And in addition, you said that you divide by `x`, however, you just divided by the *mode* of vector x (i.e. mode = most recurring number). Can you explain a bit more of your calculations? – Code Doggo May 25 '18 at 17:32
  • @WarrenWeckesser Looking at the parameters of the Statsmodels WLS model, I get one parameter of `0.643241...`, whereas I get `0.2144137...` when using Numpy. The models don't agree, so I am guessing I am incorrectly calculating the `numpy_model`? – Code Doggo May 25 '18 at 17:40
  • *"How does computing the slope find the R² value?"* It doesn't, of course. I was just amending my previous comment and question about all the x values being the same. For your question about R², you'll probably need someone more familiar with statsmodels implementation of WLS to give an answer. – Warren Weckesser May 25 '18 at 17:40
  • @WarrenWeckesser Ah okay! I will keep looking into this problem! Thanks anyways! – Code Doggo May 25 '18 at 17:41
  • @WarrenWeckesser I see the problem here. In the `patsy_equation` I have `C(x)`. I am treating my `x` values as categorical and not as numerical variables. If I just put `x` instead of `C(x)`, then I get the same model. Do you have any idea what is happening with the shift from numerical to categorical? More importantly, **how is the math changing?** – Code Doggo May 25 '18 at 18:04
  • (years later) your `Aw` is just a vector, 4 x 1 ?? Also weights are sometimes ~ sigma, sometimes ~ 1/sigma; try weights all 1. – denis Mar 30 '20 at 14:35
  • @denis If I am remembering correctly, that is actually what it ended up being. I had to use `1/weight` in order for this to start working. Also, as stated in my previous comment too, I was using the wrong patsy_equation. I was treating the values as categorical, when I should have been treating them as continuous, numerical values. – Code Doggo Mar 30 '20 at 17:36

0 Answers0