0

I'm using the StandardScalar() and lin_reg.coef_ function in the following context:

for i in range(100):
    
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=i)
    scaler = StandardScaler().fit(x_train)
    x_train = scaler.transform(x_train)
    x_test = scaler.transform(x_test)
    lin_reg = LinearRegression().fit(x_train, y_train)
    
    if i == 0:
        print(lin_reg.coef_)
    if i == 1:
        print(lin_reg.coef_)

This leads to the following output:

enter image description here

Code Output

So, as have been expected, the coef_ function returns the coefficients for the 22 different features I am passing into the linear regression. However, for the second output, some of the coefficients are way too large (e.g. 1.61e+14). I am pretty sure that the scaling with StandardScaler() works as it should be. However, if I do not scale the training data before applying the coef_ function, I do not get these high coefficients. One important thing that I should mention is that the last 13 features are binary features, whereas the first 9 features are continuous (such as age). I can imagine that the problem is somehow related to this fact, although, for the first binary feature, the coefficients are properly computed (just the last 12 binary features have too large coefficients).

mika_d98
  • 1
  • 2

1 Answers1

0

You should use Standardization when the data come from a Gaussian distribution. Using StandardScal() on binary data doesn't make any sense.

You should scale only the first 9 nine variables, and then pass them all in the linear regression.

https://www.atoti.io/when-to-perform-a-feature-scaling/ Avoid scaling binary columns in sci-kit learn StandsardScaler