1

I'm trying to apply two transformations to my target variable. First a log(y+1) transformation and then a minmax scaler. I have been trying to use the TransformedTargetRegressor function available in scikit-learn and with these two functions:

tmin = 0
tmax = 0
def target_logtransform(target):
    target_ = target.copy()
    global tmin
    global tmax
    tmin = np.min(target_)
    tmax = np.max(target_)
    target_ = (target_ - tmin)/(tmax - tmin)
    target_ = np.log(target_+1)
    return target_
def target_inverselog(target):
    global tmin
    global tmax
    target_ = target.copy()
    target_ = np.exp(target_)-1
    target_ = (target_*(tmax-tmin))+tmin
    return target_

With the regressor as follows:

#use time series split
tscv = TimeSeriesSplit(n_splits=4)
#scaler for features
col_transform = ColumnTransformer(transformers=[('num', MinMaxScaler(), [2,3,4,5])],
remainder='passthrough')
#model to be fit
m = LGBMRegressor(random_state=random_state, **params)
#set pipeline
pipeline = Pipeline(steps=[('prep',col_transform), ('m', m)]) 
#set transformedtarget regressor for target transfromation
model = TransformedTargetRegressor(regressor=pipeline, func=target_logtransform, inverse_func = target_inverselog, check_inverse = False )
score = -cross_val_score(model, X_train, y_train, cv=tscv, scoring=my_scorer).mean()

I'm confused as to how I use two transformations on the target when I would need the min and max of target used in the training to be used in cross validation.

Alexander L. Hayes
  • 3,892
  • 4
  • 13
  • 34
user3437212
  • 637
  • 1
  • 10
  • 20

1 Answers1

2

The complete answer shows two equivalent ways of solving this. Skip to the TL;DR for the solution.


Step 1: Setup the problem

Let's replace the target_logtransform and target_inverselog functions. scikit-learn has built-in methods for both:

import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import MinMaxScaler

log_transformer = FunctionTransformer(func=np.log1p, inverse_func=np.expm1)
scaler = MinMaxScaler()

We could do this is by manually rescaling our target y.

We'll do this as a sanity check to make sure we're correct later:

# Initialize some data for reproducing results:
X = np.array([-0.916,-0.916,-0.836,-0.768,-0.700,-0.608,-0.528,-0.472,-0.404,-0.300,-0.184,-0.0840,0.0480,0.168,0.328,0.468,0.640,0.760,0.872]).reshape(-1, 1)
y = np.array([0.899,0.899,0.895,0.871,0.827,0.747,0.607,0.479,0.339,0.167,0.00294,-0.0971,-0.181,-0.233,-0.309,-0.333,-0.333,-0.321,-0.301]).reshape(-1, 1)

# Manually do this in two steps
y_log = log_transformer.fit_transform(y)
y_log_scaled = scaler.fit_transform(y_log)
print(y_log_scaled)

Output:

array([[1.        ],
       [1.        ],
       [0.9979847 ],
       [0.98580284],
       ...
       [0.03378575],
       [0.        ],
       [0.        ],
       [0.01704215],
       [0.04478737]]

Step 2: Defining TwoTransformers to log then scale

Let's define a TwoTransformers class extending scikit-learn's TransformerMixin and BaseEstimator classes, and implement this object's fit_transform and inverse_transform methods. The first will look similar to our manual approach, but we can easily define the operations for the inverse:

from sklearn.base import TransformerMixin, BaseEstimator

class TwoTransformers(TransformerMixin, BaseEstimator):
    def fit_transform(self, y):
        self.log_transformer = FunctionTransformer(
            func=np.log1p,
            inverse_func=np.expm1,
        )
        self.scaler = MinMaxScaler()

        y_log = self.log_transformer.fit_transform(y)
        y_log_scaled = self.scaler.fit_transform(y_log)
        return y_log_scaled

    def inverse_transform(self, y):
        y_unscaled = self.scaler.inverse_transform(y)
        y_unscaled_unlog = self.log_transformer.inverse_transform(y_unscaled)
        return y_unscaled_unlog

And we can show it is equivalent to our earlier results:

two_steps = TwoTransformers()

print(np.all(y_log_scaled == two_steps.fit_transform(y)))
print(two_steps.fit_transform(y))

Output matches:

True
[[1.        ]
 [1.        ]
 [0.9979847 ]
 [0.98580284]
 ...
 [0.03378575]
 [0.        ]
 [0.        ]
 [0.01704215]
 [0.04478737]]

Step 3: Integrating with TransformedTargetRegressor

Let's demo with LinearRegression (ignore cross validation for now) to make sure things are working:

from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor

two_steps = TwoTransformers()

regr_trans = TransformedTargetRegressor(
    regressor=LinearRegression(),
    func=two_steps.fit_transform,
    inverse_func=two_steps.inverse_transform,
)
regr_trans.fit(X, y)
y_pred_two_step = regr_trans.predict(X)

For comparison, here is an equivalent version where we use our y_log_scaled variable from earlier, fit our model, then manually undo our operations:

clf = LinearRegression()
clf.fit(X, y_log_scaled)

# Undo the scaling
y_pred = clf.predict(X)
y_pred_unscale = scaler.inverse_transform(y_pred)
y_pred_unscale_unlog = log_transformer.inverse_transform(y_pred_unscale)

Again, we can show that both approaches get the same result:

print(np.all(y_pred_two_step == y_pred_unscale_unlog))
print(np.c_[y_pred_two_step, y_pred_unscale_unlog])

Output:

True
[[ 0.9212786   0.9212786 ]
 [ 0.9212786   0.9212786 ]
 [ 0.81566306  0.81566306]
 ...
 [-0.36027418 -0.36027418]
 [-0.41229248 -0.41229248]
 [-0.45701964 -0.45701964]]

TL;DR: Final Code

Define a class with fit_transform and inverse_transform, pass an instance to TransformedTargetRegressor:

import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor

X = np.array([-0.916,-0.916,-0.836,-0.768,-0.700,-0.608,-0.528,-0.472,-0.404,-0.300,-0.184,-0.0840,0.0480,0.168,0.328,0.468,0.640,0.760,0.872]).reshape(-1, 1)
y = np.array([0.899,0.899,0.895,0.871,0.827,0.747,0.607,0.479,0.339,0.167,0.00294,-0.0971,-0.181,-0.233,-0.309,-0.333,-0.333,-0.321,-0.301]).reshape(-1, 1)

class TwoTransformers(TransformerMixin, BaseEstimator):

    def fit_transform(self, y):
        self.log_transformer = FunctionTransformer(
            func=np.log1p,
            inverse_func=np.expm1,
        )
        self.scaler = MinMaxScaler()

        y_log = self.log_transformer.fit_transform(y)
        y_log_scaled = self.scaler.fit_transform(y_log)
        return y_log_scaled

    def inverse_transform(self, y):
        y_unscaled = self.scaler.inverse_transform(y)
        y_unscaled_unlog = self.log_transformer.inverse_transform(y_unscaled)
        return y_unscaled_unlog

two_steps = TwoTransformers()

regr_trans = TransformedTargetRegressor(
    regressor=LinearRegression(),
    func=two_steps.fit_transform,
    inverse_func=two_steps.inverse_transform,
)
regr_trans.fit(X, y)
y_pred_two_step = regr_trans.predict(X)

print(y_pred_two_step)
Alexander L. Hayes
  • 3,892
  • 4
  • 13
  • 34