I'm currently using GridSearchCV
and TimeSeriesSplit
like this so that my data is being split into 5 CV splits.
X = data.iloc[:, 0:8]
y = data.iloc[:, 8:9]
SVR_parameters = [{'kernel': ['rbf'],
'gamma': [.01,.001,1],
'C': [1,100]}]
gsc = GridSearchCV(SVR(), param_grid=SVR_parameters, scoring='neg_mean_squared_error',
cv=TimeSeriesSplit(n_splits=5).split(X), verbose=10, n_jobs=-1, refit=True)
gsc.fit(X, y)
gsc_dataframe = pd.DataFrame(gsc.cv_results_)
My understanding is that when using a scaler, you want to fit your scaler on only the training set and transform the test set with that scaler object so as to prevent data leakage so basically something like this:
scaler_X = StandardScalar()
scaler_y = StandardScalar()
scaler_X.fit(X_train)
scaler_y.fit(y_train)
X_train, X_test = scaler_X.transform(X_train), scaler_X.transform(X_test)
y_train, y_test = scaler_y.transform(y_train), scaler_y.transform(y_test)
My question is:
If i perform this type of scaling operation, how would i still get GridSearchCV
to split over my entire data set? If I just replace the X
variable in the gsc
object with X_train
- it would leave out the X_test
, right?
I'm wondering if there is a proper way to scale the data while still using all of it in GridSearchCV
I hope I explained that clearly enough. Please let me know if you need anything clarified.
Update:
adding full code to help explain better
X = data.iloc[:, 0:8]
y = data.iloc[:, 8:9]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, shuffle=False)
test_index = X_test.index.values.tolist()
scaler_x = StandardScaler()
scaler_y = StandardScaler()
scaler_x.fit(X_train)
scaler_y.fit(y_train)
X_train, X_test = scaler_x.transform(X_train), scaler_x.transform(X_test)
y_train, y_test = scaler_y.transform(y_train), scaler_y.transform(y_test)
SVR_parameters = [{'kernel': ['rbf'],
'gamma': [.1, .01, .001],
'C': [100,500,1000]}]
gsc = GridSearchCV(SVR(), param_grid=SVR_parameters, scoring='neg_mean_squared_error',
cv=TimeSeriesSplit(n_splits=5).split(X_train),verbose=10, n_jobs=-1, refit=True)
gsc.fit(X_train, y_train)
gsc_dataframe = pd.DataFrame(gsc.cv_results_)
y_pred = gsc.predict(X_test)
y_pred = scaler_y.inverse_transform(y_pred)
y_test = scaler_y.inverse_transform(y_test)
mae = round(metrics.mean_absolute_error(y_test,y_pred),2)
mse = round(metrics.mean_squared_error(y_test, y_pred),2)
y_df = pd.DataFrame(index=pd.to_datetime(test_index))
y_pred = y_pred.reshape(len(y_pred), )
y_test = y_test.reshape(len(y_test), )
y_df['Model'] = y_pred
y_df['Actual'] = y_test
y_df.plot(title='{}'.format(gsc.cv_results_['params'][gsc.best_index_]))