1

I want to run multiple linear regression models, and there are 5 independent variables (2 of them are categorical).

Thus, I first applied onehotencoder to change categorical variables into dummies.

These are dependent and independent variables

y = df['price']
x = df[['age', 'totalRooms', 'elevator',
        'floorLevel_bottom', 'floorLevel_high', 
        'floorLevel_low',
        'floorLevel_medium','floorLevel_top',
        'buildingType_bungalow', 'buildingType_plate', 
        'buildingType_plate_tower', 'buildingType_tower']]

Next, I tried the following two methods, but found that their results are different only for the intercept and categorical variables.

from sklearn.linear_model import LinearRegression

mlr = linear_model.LinearRegression()
mlr.fit(x, y)

print('Intercept: \n', mlr_in.intercept_)
print("Coefficients:")
list(zip(x, mlr_in.coef_))

This gives

Intercept: 35228.96453917408

Coefficients: [('age', 1046.5347118942063), ('totalRooms', -797.7667275033103), ('elevator', 11940.629576736419), ('floorLevel_bottom', 1011.5929167549165), ('floorLevel_high', 157.60625500592502), ('floorLevel_low', 483.89164772666277), ('floorLevel_medium', 630.9547280568961), ('floorLevel_top', -2284.0455475443687), ('buildingType_bungalow', 31610.88176756009), ('buildingType_plate', -9649.087529585862), ('buildingType_plate_tower', -8813.187607409624), ('buildingType_tower', -13148.606630564624)]

import statsmodels.formula.api as smf

x_in = sm.add_constant(x_in)
model = sm.OLS(y, x_in).fit()
print(model.summary())

but this gives


Intercept 2.43e+04
age 1046.5347
totalRooms -797.7667
elevator 1.194e+04
floorLevel_bottom 5870.7604
floorLevel_high 5016.7738
floorLevel_low 5343.0592
floorLevel_medium 5490.1223
floorLevel_top 2575.1220
buildingType_bungalow 3.768e+04
buildingType_plate -3575.1281
buildingType_plate_tower -2739.2282
buildingType_tower -7074.6472

Now I don't understand the difference between them ;(

Jleeca
  • 29
  • 3

2 Answers2

0

Did you scale the continuous variables before the analysis? I've had problems with this in the past. Probably not a silver bullet but if you haven't it could be a good start.

Also it might have something to do with the odd import statements your using. If you haven't found this QA yet it might help: OLS using statsmodel.formula.api versus statsmodel.api

JubJub
  • 35
  • 5
0

Few things to take care of assuming you have done data preprocessing exactly for each iteration. (By the variable names I think there might be something else you might've done)

  1. Set the seed to the same number so that results will pick the same random number, to begin with.
  2. Avoid dummy variable trtap and use pd.get_dummies(x, columns=['floorLevel', 'buildingType'], drop_first=True)
Next Door Engineer
  • 2,818
  • 4
  • 20
  • 33