I have checked other questions covering the topic such as this, this, this, this and this as well as some great blog posts, blog1, blog2 and blog3 (kudos to respective author) but without success.
What I want to do is to transform rows whose values are under a certain threshold in X
, but only those that correspond to some specific classes in the target y
(y != 9
). The threshold is calculated based on the other class (y == 9
). However, I have problems understanding how to implement this properly.
As I want to do parameter tuning and cross-validation on this I will have to do the transformation using a pipeline. My custom transformer class looks like below. Note that I haven't included TransformerMixin
as I believe I need to take into account for y
in the fit_transform()
function.
class CustomTransformer(BaseEstimator):
def __init__(self, percentile=.90):
self.percentile = percentile
def fit(self, X, y):
# Calculate thresholds for each column
thresholds = X.loc[y == 9, :].quantile(q=self.percentile, interpolation='linear').to_dict()
# Store them for later use
self.thresholds = thresholds
return self
def transform(self, X, y):
# Create a copy of X
X_ = X.copy(deep=True)
# Replace values lower than the threshold for each column
for p in self.thresholds:
X_.loc[y != 9, p] = X_.loc[y != 9, p].apply(lambda x: 0 if x < self.thresholds[p] else x)
return X_
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X, y)
This is then fed into a pipeline and subsequent GridSearchCV. I provide a working example below.
imports...
# Create some example data to work with
random.seed(12)
target = [randint(1, 8) for _ in range(60)] + [9]*40
shuffle(target)
example = pd.DataFrame({'feat1': sample(range(50, 200), 100),
'feat2': sample(range(10, 160), 100),
'target': target})
example_x = example[['feat1', 'feat2']]
example_y = example['target']
# Create a final nested pipeline where the data pre-processing steps and the final estimator are included
pipeline = Pipeline(steps=[('CustomTransformer', CustomTransformer(percentile=.90)),
('estimator', RandomForestClassifier())])
# Parameter tuning with GridSearchCV
p_grid = {'estimator__n_estimators': [50, 100, 200]}
gs = GridSearchCV(pipeline, p_grid, cv=10, n_jobs=-1, verbose=3)
gs.fit(example_x, example_y)
Above code gives me the following error.
/opt/anaconda3/envs/Python37/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
TypeError: transform() missing 1 required positional argument: 'y'
I have also tried other approaches such as storing corresponding class indices during fit()
and then use those during transform()
. However, as the train and test index during cross-validation is not the same it gives an index error when values are replaced in transform()
.
So, is there a clever way to solve this?