3

I am writing a function where the best model is chosen over a k-fold cross validation. Inside the function, I have a pipeline that

  1. scales the data
  2. seeks for the optimal parameters for a decision tree regressor

Then I want to use the model to predict some target values. To do so, I have to apply the same scaling that has been applied during the grid search.

Does the pipeline transform the data for which I want to predict the target using the same fit for the train data, even though I do not specify it? I've been looking in the documentation and from here seems that it does it, but I'm not sure at all since it's the first time I use pipelines.

def build_model(data, target, param_grid):
    # compute feature range
    features = df.keys()
    feature_range = dict()
    maxs = df.max(axis=0)
    mins = df.min(axis=0)
    for feature in features:
        if feature is not 'metric':
            feature_range[feature] = {'max': maxs[feature], 'min': mins[feature]}

    # initialise the k-fold cross validator
    no_split = 10
    kf = KFold(n_splits=no_split, shuffle=True, random_state=42)
    # create the pipeline
    pipe = make_pipeline(MinMaxScaler(), 
                         GridSearchCV(
                             estimator=DecisionTreeRegressor(), 
                             param_grid=param_grid, 
                             n_jobs=-1, 
                             cv=kf, 
                             refit=True))
    pipe.fit(data, target)

    return pipe, feature_range

max_depth = np.arange(1,10)
min_samples_split = np.arange(2,10)
min_samples_leaf = np.arange(2,10) 
param_grid = {'max_depth': max_depth, 
              'min_samples_split': min_samples_split, 
              'min_samples_leaf': min_samples_leaf}
pipe, feature_range = build_model(data=data, target=target, param_grid=param_grid)

# could that be correct?
pipe.fit(test_data)

EDIT: I found in the documentation for the [preprocessing] that each preprocessing tool has an API that

compute the [transformation] on a training set so as to be able reapply the same transformation on the testing set

If the case, it may save internally the transformation and therefore the answer may be positive.

Mattia Paterna
  • 1,268
  • 3
  • 15
  • 31

1 Answers1

3

The sklearn pipeline will call fit_transform or fit and then transform if no fit_transform method exists for all steps except the last step. So in your pipeline the scaling step would cause the data to be transformed before GridSearchCV.

Documentation here.

roschach
  • 8,390
  • 14
  • 74
  • 124
amanbirs
  • 1,078
  • 6
  • 11
  • 2
    I understand that, but can you confirm that the learnt transformation on training data will be applied on test data? If so, the problem is solved :) – Mattia Paterna Nov 21 '17 at 11:04
  • 1
    Yes! It will be applied when you call the predict function for the pipeline. – amanbirs Nov 21 '17 at 11:11