0

I have checked other questions covering the topic such as this, this, this, this and this as well as some great blog posts, blog1, blog2 and blog3 (kudos to respective author) but without success.

What I want to do is to transform rows whose values are under a certain threshold in X, but only those that correspond to some specific classes in the target y (y != 9). The threshold is calculated based on the other class (y == 9). However, I have problems understanding how to implement this properly.

As I want to do parameter tuning and cross-validation on this I will have to do the transformation using a pipeline. My custom transformer class looks like below. Note that I haven't included TransformerMixin as I believe I need to take into account for y in the fit_transform() function.

class CustomTransformer(BaseEstimator):

    def __init__(self, percentile=.90):
        self.percentile = percentile

    def fit(self, X, y):
        # Calculate thresholds for each column
        thresholds = X.loc[y == 9, :].quantile(q=self.percentile, interpolation='linear').to_dict()

        # Store them for later use
        self.thresholds = thresholds
        return self

    def transform(self, X, y):
        # Create a copy of X
        X_ = X.copy(deep=True)

        # Replace values lower than the threshold for each column
        for p in self.thresholds:
            X_.loc[y != 9, p] = X_.loc[y != 9, p].apply(lambda x: 0 if x < self.thresholds[p] else x)
        return X_

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X, y)

This is then fed into a pipeline and subsequent GridSearchCV. I provide a working example below.

imports...

# Create some example data to work with
random.seed(12)
target = [randint(1, 8) for _ in range(60)] + [9]*40
shuffle(target)
example = pd.DataFrame({'feat1': sample(range(50, 200), 100), 
                       'feat2': sample(range(10, 160), 100),
                       'target': target})
example_x = example[['feat1', 'feat2']]
example_y = example['target']

# Create a final nested pipeline where the data pre-processing steps and the final estimator are included
pipeline = Pipeline(steps=[('CustomTransformer', CustomTransformer(percentile=.90)),
                           ('estimator', RandomForestClassifier())])

# Parameter tuning with GridSearchCV
p_grid = {'estimator__n_estimators': [50, 100, 200]}
gs = GridSearchCV(pipeline, p_grid, cv=10, n_jobs=-1, verbose=3)
gs.fit(example_x, example_y)

Above code gives me the following error.

/opt/anaconda3/envs/Python37/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

TypeError: transform() missing 1 required positional argument: 'y'


I have also tried other approaches such as storing corresponding class indices during fit() and then use those during transform(). However, as the train and test index during cross-validation is not the same it gives an index error when values are replaced in transform().

So, is there a clever way to solve this?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Jakob
  • 663
  • 7
  • 25
  • 1
    Hey @Jakob, this use case of yours seems slightly invalid. Think about how you will provide targets (`y` here) when you deploy this, On real data you will not have the actual targets to provide. – Vivek Kumar May 23 '20 at 18:52
  • That's a very good point @VivekKumar. I won't be able to do this transformation with _any_ `y` when it's running in production simply because I won't know its class. Is that correctly understood? – Jakob May 23 '20 at 20:50
  • 1
    That's exactly what @VivekKumar means, and he is right; every feature transformation that takes into account the true labels is fundamentally invalid for this very reason. – desertnaut May 23 '20 at 22:07
  • 1
    Even when it is seemingly possible, mixing the labels in any stage of feature engineering or selection is [guaranteed to lead you astray](https://stackoverflow.com/questions/56308116/should-feature-selection-be-done-before-train-test-split-or-after/56548332#56548332). – desertnaut May 23 '20 at 22:19
  • You are both making very important points, thank you @VivekKumar and desertnaut. But let's say I'm confident that when in production the data will come in the format I'm trying to describe above (For whatever reason. But let's say my current dataset is not completely representative for what I will experience later). In order to evaluate my model prior to production I want to make the data as representative as possible (and thus transform it) and I want to do so for each train/test set pair during cross-validation as described above. Is there a way to accomplish this? – Jakob May 24 '20 at 07:32
  • No its not possible in scikit-learn's `GridSearchCV` or any cross-validator in it. The reason is if you already have the `y` then you dont need cross-validation. The main reason of using cross-validator is to get the predictions of the model and then compare them to the actual targets. So `GridSearchCV` will not pass `y` to the model, only `X` and then get the `y_pred` from the model to compare it with `y` – Vivek Kumar May 24 '20 at 12:23
  • If you really want to do it, then you need to make the `y` as a part of `X`. Maybe append the `y` to `X` as last column, use it in your custom transformer and then pass up the remaining `X` (without the last column) to the next part of pipeline. But really understand what you are doing. – Vivek Kumar May 24 '20 at 12:24
  • @Jakob So do you want a solution like what I described in above comments? – Vivek Kumar May 25 '20 at 05:49
  • @VivekKumar Thank you. I tried to implement it as you described above (unsuccessfully so far). I then reached out to the authors of the paper I'm implementing and it turned out that I had misinterpreted their approach. They hadn't replaced values for _one_ class only, but for _all_ classes, which makes my original question wrong from that perspective. However, there might still be value added for the community if such implementation can be displayed. I will accept such answer if posted. – Jakob May 25 '20 at 06:42
  • @Jakob I have posted an answer. – Vivek Kumar May 26 '20 at 06:01

1 Answers1

1

In the comments I was talking about this:

class CustomTransformer(BaseEstimator):

    def __init__(self, percentile=.90):
        self.percentile = percentile

    def fit(self, X, y):
        # Calculate thresholds for each column

        # We have appended y as last column in X, so remove that
        X_ = X.iloc[:,:-1].copy(deep=True)

        thresholds = X_.loc[y == 9, :].quantile(q=self.percentile, interpolation='linear').to_dict()

        # Store them for later use
        self.thresholds = thresholds
        return self

    def transform(self, X):
        # Create a copy of actual X, except the targets which are appended

        # We have appended y as last column in X, so remove that
        X_ = X.iloc[:,:-1].copy(deep=True)

        # Use that here to get y
        y =  X.iloc[:, -1].copy(deep=True)

        # Replace values lower than the threshold for each column
        for p in self.thresholds:
            X_.loc[y != 9, p] = X_.loc[y != 9, p].apply(lambda x: 0 if x < self.thresholds[p] else x)
        return X_

    def fit_transform(self, X, y):
        return self.fit(X, y).transform(X)

And then change your X, y:

# We are appending the target into X
example_x = example[['feat1', 'feat2', 'target']]
example_y = example['target']
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Oh okay, so you appended y to X _before_ feeding it to fit(). That's a better approach. Works like a charm. Thank you @VivekKumar. – Jakob May 26 '20 at 08:18
  • @Jakob But this is a hypothetical condition where you are using `y` for testing / validation data already, instead of just comparing – Vivek Kumar May 28 '20 at 16:04