standardize data with K-Fold cross validation

Question

I'm using StratifiedKFold so my code looks like this

def train_model(X,y,X_test,folds,model):
    scores=[]
    for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
        X_train,X_valid = X[train_index],X[valid_index]
        y_train,y_valid = y[train_index],y[valid_index]        
        model.fit(X_train,y_train)
        y_pred_valid = model.predict(X_valid).reshape(-1,)
        scores.append(roc_auc_score(y_valid, y_pred_valid))
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
folds = StratifiedKFold(10,shuffle=True,random_state=0)
lr = LogisticRegression(class_weight='balanced',penalty='l1',C=0.1,solver='liblinear')
train_model(X_train,y_train,X_test,repeted_folds,lr)

now before train the model I want to standardize the data so which is the correct way?
1)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

doing this before calling train_model function

2)
doing standardization inside function like this

def train_model(X,y,X_test,folds,model):
    scores=[]
    for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
        X_train,X_valid = X[train_index],X[valid_index]
        y_train,y_valid = y[train_index],y[valid_index]
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_vaid = scaler.transform(X_valid)
        X_test = scaler.transform(X_test)
        model.fit(X_train,y_train)
        y_pred_valid = model.predict(X_valid).reshape(-1,)

        scores.append(roc_auc_score(y_valid, y_pred_valid))

    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))

As per my knowlwdge in 2nd option I'm not leaking the data.so which way is correct if I'm not using pipeline and also how to use pipeline if i want to use cross validation?

score 3 · Accepted Answer · answered Nov 19 '19 at 17:34

3

Indeed the second option is better because the scaler does not see the values of X_valid to scale X_train.

Now if you were to use a pipeline, you can do:

from sklearn.pipeline import make_pipeline

def train_model(X,y,X_test,folds,model):
    pipeline = make_pipeline(StandardScaler(), model)
    ...

And then use pipeline instead of model. At every fit or predict call, it will automatically standardize the data at hand.

Note that you can also use the cross_val_score function from scikit-learn, with the parameter scoring='roc_auc'.

answered Nov 19 '19 at 17:34

Horace

1,024
7
12

so if I want to use cross_val_score than first I have to create folds = StratifiedKFold(10,shuffle=True,random_state=0) after that make pipeline with StandardScaler and model after that use cross_val_score(pipeline,X,y,cv=folds,scoring='roc_auc') so here X and y means whole train data right? – Utsav Patel Nov 19 '19 at 17:53
Yes the whole train data, it takes care of the splits for you, according to the parameter you pass to `cv`, in your case a cv-splitter. – Horace Nov 19 '19 at 17:56
so if I am using cross_val_score than I don't have to use for loop with folds.splits(X,y) both are the same thing, right? – Utsav Patel Nov 19 '19 at 17:58
@Utsav Patel Sorry for deleting my answer ,but i fear if my answer wasn't so clear. You can find what i actually mean here ,hope find it useful https://stats.stackexchange.com/questions/27627/normalization-prior-to-cross-validation – 4.Pi.n Nov 19 '19 at 18:21

score 0 · Answer 2 · edited Aug 28 '23 at 22:35

0

IMO if your data are large then it probably doesn't matter too much (if you're using k-fold this may not be the case) but since you can, it's better to do it within your cross validation (k-fold), or option 2.

Also, see this for more information on overfitting in cross validation.

edited Aug 28 '23 at 22:35

desertnaut

57,590
26
140
166

answered Nov 19 '19 at 17:27

nepdavis

126
5

standardize data with K-Fold cross validation

2 Answers2

Linked