customized transformerMixin with data labels in sklearn

Question

I'm working on a small project where I'm trying to apply SMOTE "Synthetic Minority Over-sampling Technique", where my data is imbalanced ..

I created a customized transformerMixin for the SMOTE function ..

class smote(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        print(X.shape, ' ', type(X)) # (57, 28)   <class 'numpy.ndarray'>
        print(len(y), ' ', type)     #    57      <class 'list'>
        smote = SMOTE(kind='regular', n_jobs=-1)
        X, y = smote.fit_sample(X, y)

        return X

    def transform(self, X):
        return X

model = Pipeline([
        ('posFeat1', featureVECTOR()),
        ('sca1', StandardScaler()),
        ('smote', smote()),
        ('classification', SGDClassifier(loss='hinge', max_iter=1, random_state = 38, tol = None))
    ])
    model.fit(train_df, train_df['label'].values.tolist())
    predicted = model.predict(test_df)

I implemented the SMOTE on the FIT function because I don't want it to be applied on the test data ..

and unfortunately, I got this error:

     model.fit(train_df, train_df['label'].values.tolist())
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
    **fit_params_steps[name])
  File "C:\Python35\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Python35\lib\site-packages\sklearn\base.py", line 520, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
AttributeError: 'numpy.ndarray' object has no attribute 'transform'

score 13 · Accepted Answer · edited Jul 02 '19 at 12:29

fit() mehtod should return self, not the transformed values. If you need the functioning only for train data and not test, then implement the fit_transform() method.

class smote(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        print(X.shape, ' ', type(X)) # (57, 28)   <class 'numpy.ndarray'>
        print(len(y), ' ', type)     #    57      <class 'list'>
        self.smote = SMOTE(kind='regular', n_jobs=-1).fit(X, y)

        return self

    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.smote.sample(X, y)

    def transform(self, X):
        return X

Explanation: On the train data (i.e. when pipeline.fit() is called) Pipeline will first try to call fit_transform() on the internal objects. If not found, then it will call fit() and transform() separately.

On the test data, only the transform() is called for each internal object, so here your supplied test data should not be changed.

Update: The above code will still throw error. You see, when you oversample the supplied data, the number of samples in X and y both change. But the pipeline will only work on the X data. It will not change the y. So either you will get error about unmatched samples to labels if I correct the above error. If by chance, the generated samples are equal to previous samples, then also the y values will not correspond to the new samples.

Working solution: Silly me.

You can just use the Pipeline from the imblearn package in place of scikit-learn Pipeline. It takes care automatically to re-sample when called fit() on the pipeline, and does not re-sample test data (when called transform() or predict()).

Actually I knew that imblearn.Pipeline handles sample() method, but was thrown off when you implemented a custom class and said that test data must not change. It did not come to my mind that thats the default behaviour.

Just replace

from sklearn.pipeline import Pipeline

with

from imblearn.pipeline import Pipeline

and you are all set. No need to make a custom class as you did. Just use original SMOTE. Something like:

random_state = 38
model = Pipeline([
        ('posFeat1', featureVECTOR()),
        ('sca1', StandardScaler()),

        # Original SMOTE class
        ('smote', SMOTE(random_state=random_state)),
        ('classification', SGDClassifier(loss='hinge', max_iter=1, random_state=random_state, tol=None))
    ])

what u mean by: .sample(X, y) ? Also, why u didn't implement the smote inside the fit_transform rather than calling it !? — Minions, Apr 11 '18 at 10:45
@Minion When you do `fit_sample()` you are joining two functions together, `fit()` and `sample()`. I have moved the `fit()` part to `fit()` and calling only `sample()` in `fit_transform()`. You can copy the whole `fit()` in there if you want. I did that just for code clarity. — Vivek Kumar, Apr 11 '18 at 10:55
ur code produces an error: File "C:\Python35\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not broadcast input array from shape (81,28) into shape (81) — Minions, Apr 11 '18 at 12:33
@Minion Please provide your complete code and some data samples which produce the error. Edit the question to add the details. — Vivek Kumar, Apr 11 '18 at 12:35
its hard to do that, the code is very large and independent .. but I think the error is this line: return self.smote.sample(X,y) .. are you sure about it ? — Minions, Apr 11 '18 at 12:39
@Minion Yes, I got the error. I will edit the answer to add the explanation. — Vivek Kumar, Apr 11 '18 at 12:41
What is the difference between fit_transform and fit_resample which one should i need to use? — Kirushikesh, Nov 14 '21 at 17:07
@Kirushikesh In smote, fit_transform has no effect. It is simply there for compatibility with scikit-learn pipeline. Tell me more about what you want to do. — Vivek Kumar, Nov 16 '21 at 02:01

customized transformerMixin with data labels in sklearn

1 Answers1

Linked