2

I trained a set of LinearRegression models using the following GridSearchCV

MAX_COLUMNS=list(range(2, len(house_df.columns)))

X = house_df.drop(columns=['SalePrice'])
y = house_df.loc[:, 'SalePrice']

column_list = MAX_COLUMNS

# Box-cox transform the target 
reg_strategy = TransformedTargetRegressor()
bcox_transformer = PowerTransformer(method='box-cox')


model_pipeline = Pipeline([("std_scaler", StandardScaler()),
                           ('feature_selector', SelectKBest()),
                           ('regress', reg_strategy)])


parameter_grid = [{'feature_selector__k' : column_list,
                   'feature_selector__score_func' : [f_regression, mutual_info_regression],
                   'regress__regressor' : [LinearRegression()],
                   'regress__regressor__fit_intercept' : [True],
                   'regress__transformer' : [None, bcox_transformer]}]


score_types = {'MSE' : 'neg_mean_squared_error', 'r2' : 'r2'}

gs = GridSearchCV(estimator=model_pipeline, param_grid=parameter_grid, scoring=score_types, refit='MSE', cv=5, n_jobs=5, verbose=1)

gs.fit(X, y)

PATH = './datasets/processed_data/'
gridsearch_result_filename = 'pfY_np10_nt2_rfS_ct0_8_st1_orY_ccY_LR1_GS.pkl'
full_path = PATH + gridsearch_result_filename
with open(full_path, 'wb') as file:
    pickle.dump(gs, file)

I then load the trained GridSearch and can make predictions using the best estimator as follows:

with open(MODEL_PATH, 'rb') as file:
    gs_results = pickle.load(file)


predictions = gs_results.predict(test_df)

The problem I am facing is that since the Box-Cox transform was applied during GridSearch, all of my predictions are in the Box-Cox transformed distributions domain (huge values).

I need to use the PowerTransformers inverse_transform() method on my predictions, but I am not sure how to access it.

I can get the entire pipeline for the best estimator like this

gs_results.best_estimator_

I can then access the TransformedTargetRegressor inside the pipeline like this:

TransformedTargetRegressor

Taking a step further, we get all the way to the PowerTransformer inside the TransformedTargetRegressor like this:

PowerTransformer

After making it here, I foolishly thought I had made it where I needed to be, and simply needed to use the PowerTransformers inverse_transform method to make predictions that would be back in the original units. However, much to my disappointment, an error is thrown:

Not Fitted

The error seems pretty clear, telling me I cannot use the inverse_transform method because the PowerTransformer has not been fit.

This is where I am stumped. It doesn't make sense to say the PowerTransformer has not been fit, when clearly it was fit during the GridSearch process.

This makes me think I am simply accessing the PowerTransformer incorrectly, which is my current question.

Based on the set up above, does anyone know the correct way to take the inverse transform of my predictions so they are in the original units rather than the Box-Cox distributions units?

I have been banging my head against the wall for this and have searched all over for a similar question. Thank you so much in advance!

-Braden

Braden Anderson
  • 141
  • 1
  • 11

1 Answers1

2

Much like here, the attribute transformer is the unfitted initialization attribute; you need the fitted transformer_ attribute.

However, I'm not sure why predict doesn't already do what you want; the documentation for TransformedTargetRegressor.predict says

Predict using the base regressor, applying inverse.

The regressor is used to predict and the inverse_func or inverse_transform is applied before returning the prediction.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • Thank you for your response. I see exactly what you're saying. When I make predictions on the training data, the inverse is clearly applied automatically and the predictions are reasonable, like this: ```gs_results.predict(X_train)``` but making predictions on the test data yields predicted values 3 or 4 orders of magnitude larger. Initially when I saw the large predictions, I assumed there must be some inverse transform needed. Since the inverse transform is automatically applied like you say (and I can see on my training data), I am now more unsure what the real issue is. – Braden Anderson Jun 18 '21 at 19:37
  • As it turns out I believe the problem was not at all as I suspected in my original post. I think my model and just become unbelievably overfit and the outputs threw me off. Not sure if the correct thing is to delete this post since the problem ended to being not as I described it above. Thank you again for taking the time to respond – Braden Anderson Jun 23 '21 at 06:16