I've been trying to get into Python and have been using some online courses (I'm working with Jupyter Notebooks, if that matters, and Python 3). In one, it was about statsmodels and regressions. As far as my statistics courses have told me, you want to include an intercept (I'm sure there are reasons not to, but afaik it's the exception).
1) I tried asking google and stumbled across an example I don't quite get: This is an example from the statsmodels site:
import statsmodels.api as sm
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
results.params
I get what they're doing here. However, just to try some things out, I thought I'd leave out the intercept:
import statsmodels.api as sm
Y = [1,3,4,5,2,3,4]
X = range(1,8)
model = sm.OLS(Y,X)
results = model.fit()
results.params
Question 1: This returns an error: ValueError Traceback (most recent call last) <ipython-input-3-c8dfe3eb8b44> in <module>
. It points at line model = sm.OLS(Y,X)
for the error - why?
2a) Here's the code as it was in the course:
It's about predicting the price of a car based on multiple variables (mileage, cylinders, doors)
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
%matplotlib inline
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']
X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)
print (X)
est = sm.OLS(y, X).fit()
est.summary()
Question 2: This seems to work, but it also returns an error: "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead" - What does that mean? Is it just a heads up from pandas to keep warn about potentially wrong syntax, as this discussion seems to suggest?
2b) Same code with an intercept:
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
%matplotlib inline
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']
X = sm.tools.tools.add_constant(X)
X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)
print (X)
est = sm.OLS(y, X).fit()
est.summary()
Question 3: The coefficients don't change compared to the model without adding the constant - what am I doing wrong? Also, when executing print(X)
, the constant is listed as 1
observation, is that because it's basically a placeholder at that point? But why wouldn't it be 0?
Question 4: And to stay on topic of what I am not understanding: When standardization is applied with scale.fit_transform
, does it matter if the constant is added before or after it?
If someone could help me with any of these questions, I'd really appreciate it.