Transformations with ColumnTransformer

Question

I am trying to use ColumnTransformer to do a full transformation of the titanic dataset from kaggle.

I have this

from sklearn.base import BaseEstimator, TransformerMixin

class TransformData(BaseEstimator,TransformerMixin):
    def __init__(self, change = True):
        self.change = change
        
    def fit(self,X,y=None):
        return self
    
    def transform(self, X, y=None):
        
        def guess_age(x):
            return X.groupby("title").Age.mean()[x].round()
        def get_title(x):
            return x.split(',')[1].split('.')[0].strip()
        
        X['title'] = X.Name.apply(get_title)
        X['guessed_age'] = X.title.apply(guess_age)
        X.Age.fillna(X['guessed_age'],inplace=True)
        X.loc[X.Cabin.notnull(),'Cabin'] = 1
        X.loc[X.Cabin.isna(),'Cabin'] = 0
        X.drop(['guessed_age','PassengerId','Ticket','Name','title'],axis=1,inplace = True)
        
        return X

Which will add some columns that I find useful and drop the ones I don't want to.

However, when I try to put this together with the standardization of numerical attributes and one hot encoding of categorical ones

num_attribs = ['Age','SibSp','Parch','Fare']
cat_attribs = ["Embarked","Pclass","Sex"]

full_pipeline = ColumnTransformer(transformers=[
    ("transformation", TransformData(),train_df.drop('Survived',axis=1).columns),
    ("num", StandardScaler(), num_attribs),
    ("cat", OneHotEncoder(sparse=False), cat_attribs),
])

The final transformation gives me back a dataset with the old columns and the new transformed ones:

Multiple Column results

What I was able to do was doing the fit_transform first with the TransformData, and then putting the transformed data into a pipeline to do the aforementioned operations with numerical and categorical data.

What am I doing wrong? How to resolve this?

Hey, thanks for you asnwer, it does! – maolmedilla Aug 25 '23 at 07:26 — maolmedilla, Aug 25 '23 at 07:26

Musabbir Arrafi · Answer 1 · 2023-08-25T12:09:53.803

You're mostly right, just missed to assign the output after the pd.drop() method. Here's your corrected code:

from sklearn.base import BaseEstimator, TransformerMixin

class TransformData(BaseEstimator,TransformerMixin):
    def __init__(self, change = True):
        self.change = change
        
    def fit(self,X,y=None):
        return self
    
    def transform(self, X, y=None):
        
        def guess_age(x):
            return X.groupby("title").Age.mean()[x].round()
        def get_title(x):
            return x.split(',')[1].split('.')[0].strip()
        
        X['title'] = X.Name.apply(get_title)
        X['guessed_age'] = X.title.apply(guess_age)
        X.Age.fillna(X['guessed_age'],inplace=True)
        X.loc[X.Cabin.notnull(),'Cabin'] = 1
        X.loc[X.Cabin.isna(),'Cabin'] = 0
        X = X.drop(['guessed_age','PassengerId',
                    'Ticket','Name','title'],
                   axis=1)
        
        return X

The Question has `inplace=True`, which modifies the dataframe directly; reassigning isn't necessary. (Although, you probably shouldn't use `inplace=True`: https://stackoverflow.com/q/45570984/10495893) — Ben Reiniger, Aug 24 '23 at 18:59

Transformations with ColumnTransformer

1 Answers1