0

I am trying to use ColumnTransformer to do a full transformation of the titanic dataset from kaggle.

I have this

from sklearn.base import BaseEstimator, TransformerMixin

class TransformData(BaseEstimator,TransformerMixin):
    def __init__(self, change = True):
        self.change = change
        
    def fit(self,X,y=None):
        return self
    
    def transform(self, X, y=None):
        
        def guess_age(x):
            return X.groupby("title").Age.mean()[x].round()
        def get_title(x):
            return x.split(',')[1].split('.')[0].strip()
        
        X['title'] = X.Name.apply(get_title)
        X['guessed_age'] = X.title.apply(guess_age)
        X.Age.fillna(X['guessed_age'],inplace=True)
        X.loc[X.Cabin.notnull(),'Cabin'] = 1
        X.loc[X.Cabin.isna(),'Cabin'] = 0
        X.drop(['guessed_age','PassengerId','Ticket','Name','title'],axis=1,inplace = True)
        
        return X

Which will add some columns that I find useful and drop the ones I don't want to.

However, when I try to put this together with the standardization of numerical attributes and one hot encoding of categorical ones

num_attribs = ['Age','SibSp','Parch','Fare']
cat_attribs = ["Embarked","Pclass","Sex"]

full_pipeline = ColumnTransformer(transformers=[
    ("transformation", TransformData(),train_df.drop('Survived',axis=1).columns),
    ("num", StandardScaler(), num_attribs),
    ("cat", OneHotEncoder(sparse=False), cat_attribs),
])

The final transformation gives me back a dataset with the old columns and the new transformed ones:

Multiple Column results

What I was able to do was doing the fit_transform first with the TransformData, and then putting the transformed data into a pipeline to do the aforementioned operations with numerical and categorical data.

What am I doing wrong? How to resolve this?

desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

-1

You're mostly right, just missed to assign the output after the pd.drop() method. Here's your corrected code:

from sklearn.base import BaseEstimator, TransformerMixin

class TransformData(BaseEstimator,TransformerMixin):
    def __init__(self, change = True):
        self.change = change
        
    def fit(self,X,y=None):
        return self
    
    def transform(self, X, y=None):
        
        def guess_age(x):
            return X.groupby("title").Age.mean()[x].round()
        def get_title(x):
            return x.split(',')[1].split('.')[0].strip()
        
        X['title'] = X.Name.apply(get_title)
        X['guessed_age'] = X.title.apply(guess_age)
        X.Age.fillna(X['guessed_age'],inplace=True)
        X.loc[X.Cabin.notnull(),'Cabin'] = 1
        X.loc[X.Cabin.isna(),'Cabin'] = 0
        X = X.drop(['guessed_age','PassengerId',
                    'Ticket','Name','title'],
                   axis=1)
        
        return X
Musabbir Arrafi
  • 744
  • 4
  • 18
  • The Question has `inplace=True`, which modifies the dataframe directly; reassigning isn't necessary. (Although, you probably shouldn't use `inplace=True`: https://stackoverflow.com/q/45570984/10495893) – Ben Reiniger Aug 24 '23 at 18:59