I think the issue is that because you are passing three cols on the left, sklearn is getting confused.
Alternative
But as @unutbu said, there is no difference in performance between df.apply
and for
so I'd just use this:
for col in ['sex','embarked','alive']:
df[col] = pp.fit_transform(df[col])
But if you really do a one liner, here's how you do it (warning, massive overkill):
Add another layer of indentation to the fit
, tranform
and fit_transform
methods as the formatting doesn't work (should match the indentation of the def __init__
method.
class MultiColumnLabelBinarizer:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode`
def fit(self,X,y=None):
return self # not relevant here
def transform(self,X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelBinarizer().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelBinarizer().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
df = MultiColumnLabelBinarizer(columns = ['embarked','alive']).fit_transform(df)
Source: Label encoding across multiple columns in scikit-learn