1

I'm currently using GridSearchCV and TimeSeriesSplit like this so that my data is being split into 5 CV splits.

X = data.iloc[:, 0:8]
y = data.iloc[:, 8:9]

SVR_parameters = [{'kernel': ['rbf'],
               'gamma': [.01,.001,1],
               'C': [1,100]}]

gsc = GridSearchCV(SVR(), param_grid=SVR_parameters, scoring='neg_mean_squared_error',
                cv=TimeSeriesSplit(n_splits=5).split(X), verbose=10, n_jobs=-1, refit=True)
gsc.fit(X, y)
gsc_dataframe = pd.DataFrame(gsc.cv_results_)

My understanding is that when using a scaler, you want to fit your scaler on only the training set and transform the test set with that scaler object so as to prevent data leakage so basically something like this:

            scaler_X = StandardScalar()
            scaler_y = StandardScalar()
            scaler_X.fit(X_train)
            scaler_y.fit(y_train)
            X_train, X_test = scaler_X.transform(X_train), scaler_X.transform(X_test)
            y_train, y_test = scaler_y.transform(y_train), scaler_y.transform(y_test)

My question is: If i perform this type of scaling operation, how would i still get GridSearchCV to split over my entire data set? If I just replace the X variable in the gsc object with X_train - it would leave out the X_test, right?

I'm wondering if there is a proper way to scale the data while still using all of it in GridSearchCV

I hope I explained that clearly enough. Please let me know if you need anything clarified.


Update:

adding full code to help explain better

X = data.iloc[:, 0:8]
y = data.iloc[:, 8:9]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, shuffle=False)

test_index = X_test.index.values.tolist()

scaler_x = StandardScaler()
scaler_y = StandardScaler()
scaler_x.fit(X_train)
scaler_y.fit(y_train)

X_train, X_test = scaler_x.transform(X_train), scaler_x.transform(X_test)
y_train, y_test = scaler_y.transform(y_train), scaler_y.transform(y_test)




SVR_parameters = [{'kernel': ['rbf'],
               'gamma': [.1, .01, .001],
               'C': [100,500,1000]}]

gsc = GridSearchCV(SVR(), param_grid=SVR_parameters, scoring='neg_mean_squared_error',
   cv=TimeSeriesSplit(n_splits=5).split(X_train),verbose=10, n_jobs=-1, refit=True)

gsc.fit(X_train, y_train)
gsc_dataframe = pd.DataFrame(gsc.cv_results_)
y_pred = gsc.predict(X_test)
y_pred = scaler_y.inverse_transform(y_pred)
y_test = scaler_y.inverse_transform(y_test)
mae = round(metrics.mean_absolute_error(y_test,y_pred),2)
mse = round(metrics.mean_squared_error(y_test, y_pred),2)
y_df = pd.DataFrame(index=pd.to_datetime(test_index))
y_pred = y_pred.reshape(len(y_pred), )
y_test = y_test.reshape(len(y_test), )
y_df['Model'] = y_pred
y_df['Actual'] = y_test
y_df.plot(title='{}'.format(gsc.cv_results_['params'][gsc.best_index_]))
novawaly
  • 1,051
  • 3
  • 12
  • 27

1 Answers1

2

Use a pipeline (https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) :

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('clf', SVR())])

param_grid = dict(clf__gamma = [.01,.001,1],
                  clf__C = [1,100],
                  clf__kernel = ['rbf','linear'])

gsc = GridSearchCV(pipe, param_grid = param_grid, scoring='neg_mean_squared_error',
            cv=TimeSeriesSplit(n_splits=5).split(X), verbose=10, n_jobs=-1, refit=True)

gsc.fit(X,y)
print(gsc.best_estimator_)

See also this post for the behind the scenes steps: Apply StandardScaler in Pipeline in scikit-learn (sklearn)

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • I actually posted a response on that post without realizing you were the one who answered my post. My follow up question would be if I used a pipeline like you suggested, would I lose the ability to have a separate Scaler object to fit both the X and y data sets separately to avoid "data leakage" the way I've outlined in my 2nd code bloke? I'm not really even sure if that extra step is necessary @serafeim – novawaly Jun 24 '19 at 12:52
  • Also - If I use the Pipeline. Is it possible to then access the inverse_transform method of the StandardScalar to get my data back to it's original format or do can you only use the inverse_transform method of the pipeline which will undo all of the transformation steps? – novawaly Jun 24 '19 at 13:14
  • 1
    for the first question you do not need to apply the `StandardScalar ` on the `y` since `y` is the variable that you want to predict. Use the code as it is in my answer and there is no problem of data leakage (i have explained here how it works step by step: https://stackoverflow.com/a/51465479/5025009 ) – seralouk Jun 24 '19 at 13:54
  • So if my X is 8 columns of all numbers with differing ranges so I apply StandardScalar. It will go through each column individually and calc mean/std deviation for each and then predict a value for y during the GridSearchCV step. If y is always a number between 60-100, the predicted value will always in the scaled terms (number between -1 and 1), right? So, when I apply an inverse_transform on that predicted value to visualize it in the 60-100 scale of the original y, wouldn't I need the y_scale to do that? – novawaly Jun 24 '19 at 14:09
  • I added my actual code block b/c that question about why I need the y_scaler was probably too convoluted. Am I doing something wrong there with my logic? Essentially, I just like it in the same scale so I can plot it easier. – novawaly Jun 24 '19 at 14:17
  • `If y is always a number between 60-100, the predicted value will always in the scaled terms (number between -1 and 1), right` - No, you use the scaled data to predict y. you do not change the `y` at all. you only bring to similar scale the independent variables (x) – seralouk Jun 24 '19 at 14:56
  • You're right. I just removed it. Im still a little confused as to whats happening. So: I fit X_test with a Scalar object. I use that instance to transform X_test and X_train. I fit a model on X_train and y_train. I then predict X_test. If we take the simple linear regression and fit a line through the scaled X_train data - it'll be a line somewhere btwn -1 and 1. if I then predict an X_test value...how is that I'm getting an output thats still in same scale as my original y that wasn't touched? Wouldn't that line of best fit produce something around -1 and 1 if I'm predicting on scaled X_test? – novawaly Jun 24 '19 at 17:30
  • Again, you do not need to worry about this. This is done internally by the GridSearchCV. – seralouk Jun 24 '19 at 17:43
  • Understood - the question was just for my own understanding of the Scaling process in general and whats happening behind the scenes. I'm missing that last step – novawaly Jun 24 '19 at 17:45
  • the scaled y can be transformed back to y_original but you do not need this. The output of `gsc.best_score_` will give you the best `neg_mean_squared_error ` achieved in the GridSearch. Consider accepting my answer and have a look here: https://stackoverflow.com/a/52455977/5025009 – seralouk Jun 24 '19 at 17:48
  • I think I'm just asking my question incorrectly. Assume we're not doing any GridSearch. Just splitting the data in to train_test splits, scaling the data, fitting the model and predicing the test set. I'm wondering why i DONT have to do the the inverse_scale operation on my model.predict(X_test) results if I've scaled the X_test data. Shouldn't that return results on the scale of X_test (similar to user who asked the question in your last link) requiring me to use inverse_transform? I feel like I'm missing a step somewhere or my understanding is flawed – novawaly Jun 24 '19 at 18:03
  • Yes, you are right. You need to scale the prediction back to the original scale in this case (if you predict using the scaled data). However, if you use the GridSearchCV method, this is done automatically and internally. – seralouk Jun 24 '19 at 18:09
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195477/discussion-between-novawaly-and-serafeim). – novawaly Jun 24 '19 at 18:10