Sklearn.pipeline producing incorrect result

Question

I am trying to construct a pipeline with a StandardScaler() and LogisticRegression(). I get different results when I code it with and without the pipeline. Here's my code without the pipeline:

clf_LR = linear_model.LogisticRegression()
scalar = StandardScaler()
X_train_std = scalar.fit_transform(X_train)
X_test_std = scalar.fit_transform(X_test)
clf_LR.fit(X_train_std, y_train)
print('Testing score without pipeline: ', clf_LR.score(X_test_std, y_test))

My code with pipeline:

pipe_LR = Pipeline([('scaler', StandardScaler()), 
                    ('classifier', linear_model.LogisticRegression())
                   ])
pipe_LR.fit(X_train, y_train)
print('Testing score with pipeline: ', pipe_LR.score(X_test, y_test))

Here is my result:

Testing score with pipeline:  0.821917808219178
Testing score without pipeline:  0.8767123287671232

While trying to debug the problem, it seems the data is being standardized. But the result with pipeline matches the result of training the model on my original X_train data (without applying StandardScaler()).

clf_LR_orig = linear_model.LogisticRegression()
clf_LR_orig.fit(X_train, y_train)
print('Testing score without Standardization: ', clf_LR_orig.score(X_test, y_test))

Testing score without Standardization:  0.821917808219178

Is there something I am missing in the construction of the pipeline? Thanks very much!

I am not sure but I think the problem is with your first snippet. Try to change `X_test_std = scalar.fit_transform(X_test)` into `transform` method. You should never fit your transformers to your test data. Let me know if it works so I will describe it in more details. — Szymon Bednorz, Aug 25 '20 at 17:10
Your test data is getting standardized in without pipeline code. While not in with pipeline. It caused the result to be different. Use same sort of test data in both the testing. — Muhammad Hamza Sabir, Aug 25 '20 at 17:27

Sanyam Lakhanpal · Answer 1 · 2020-08-27T07:55:52.727

As szymon-bednorz commented ,generally we don't fit_transform on test data, rather we go for fit_transform(X_train) and transform(X_test).This works pretty well, when your training and test data are from same distribution, and size of X_train is greater than X_test.

Further as you found while debugging that fitting through pipeline gives same accuracy as fitting logistic regression hints that X_train and X_test already scaled. Although I am not sure about this.

Sklearn.pipeline producing incorrect result

1 Answers1