0

I am trying to running a regression model with two different functions: OLS from statsmodels.api and linear_regression from sklearn, the output seems to be quite different from each other.

Here is my code:

import statsmodels.api as sm
import pandas as pd
import matplotlib
import scipy.stats as stats
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn import linear_model

data = pd.read_excel("2001_SCF_Pivot.xlsx")
y,x = dmatrices("np.log(RETQLIQ) ~ W_P_ADJ+np.power(W_P_ADJ,2)+np.power(W_P_ADJ,3)+INCOME+np.power(INCOME,2)+WHITE+AGE+EDUC+FEMALE+SINGLE",data = data)

LinearRegression = linear_model.LinearRegression()
ols = LinearRegression.fit(x,y)
sm_prediction = ols.predict(x)

model_fit = sm.OLS(y,x)
results = model_fit.fit()
sklearn_prediction = results.predict(x)

When I scatter the data and add both predictions on the graph while in theory I need to get two plots on each other, the prediction of the two functions seems to be quite different as you can see from the attached image. My question is why do I get different results and what is the right way to do it in this case, thanks a lot in advance!

You can find the related graph here : https://i.stack.imgur.com/WKJqQ.jpg

newbiee
  • 11
  • 3
  • 1
    A very good breakdown between the two can be found on this [stats.se answer](https://stats.stackexchange.com/a/146809) and [this answer](https://stackoverflow.com/questions/22054964/ols-regression-scikit-vs-statsmodels), as to why the models might be different as well as the interfaces. One additional comment I'll make is that you don't set your random seed, so even between consecutive runs of the same package, you'll likely see differences. – G. Anderson May 21 '19 at 18:32
  • The thing is this link you referred mostly focuses on training and testing and how two functions have different approaches to them, although in this simple exercise there is no train and test datas. Only thing the two functions needs to do is just applying the minimization procedure, get the coefficients and apply them on the raw data which they basically do something different then each other, I wonder why it is exactly. – newbiee May 21 '19 at 18:48
  • the second answer I linked goes into specifics on how the two are different without tuning and how to make the two converge more closely. Was that helpful at all for your situation? – G. Anderson May 21 '19 at 18:52

1 Answers1

1

I had a similar problem with OLS until I saw this:

No constant is added by the model unless you are using formulas.

I saw the summary and there was no constant!!!

I fixed that using a new variable:

x_ols = sm.add_constant(x_my_old_data)

Then I used OLS with that variable:

linear_sm = sm.OLS(y_my_old_data,x_ols).fit()

If I wanted a prediction then I had to use this weird x_ols:

y_pred = linear_sm.predict(x_ols)

And if I wanted to plot it then I used x_my_old_data:

plt.plot(x_my_old_data,y_my_old_data)

statsmodels.formula.api has the constant included, so you don't need to do these weird things.

Miguel
  • 11
  • 2