I am trying to use ColumnTransformer to do a full transformation of the titanic dataset from kaggle.
I have this
from sklearn.base import BaseEstimator, TransformerMixin
class TransformData(BaseEstimator,TransformerMixin):
def __init__(self, change = True):
self.change = change
def fit(self,X,y=None):
return self
def transform(self, X, y=None):
def guess_age(x):
return X.groupby("title").Age.mean()[x].round()
def get_title(x):
return x.split(',')[1].split('.')[0].strip()
X['title'] = X.Name.apply(get_title)
X['guessed_age'] = X.title.apply(guess_age)
X.Age.fillna(X['guessed_age'],inplace=True)
X.loc[X.Cabin.notnull(),'Cabin'] = 1
X.loc[X.Cabin.isna(),'Cabin'] = 0
X.drop(['guessed_age','PassengerId','Ticket','Name','title'],axis=1,inplace = True)
return X
Which will add some columns that I find useful and drop the ones I don't want to.
However, when I try to put this together with the standardization of numerical attributes and one hot encoding of categorical ones
num_attribs = ['Age','SibSp','Parch','Fare']
cat_attribs = ["Embarked","Pclass","Sex"]
full_pipeline = ColumnTransformer(transformers=[
("transformation", TransformData(),train_df.drop('Survived',axis=1).columns),
("num", StandardScaler(), num_attribs),
("cat", OneHotEncoder(sparse=False), cat_attribs),
])
The final transformation gives me back a dataset with the old columns and the new transformed ones:
What I was able to do was doing the fit_transform first with the TransformData, and then putting the transformed data into a pipeline to do the aforementioned operations with numerical and categorical data.
What am I doing wrong? How to resolve this?