1

I am trying my hand on the Titanic dataset.

I would like to use the LabelBinarizer on a few columns and I would like to avoid using a for loop.

I am trying to use a lambda function but it doesn't work:

from sklearn.preprocessing import LabelBinarizer 

pp = LabelBinarizer()

X = df['sex', 'embarked', 'alive'] df.apply(lambda X: pp.fit_transform())

And:

df[['sex','embarked','alive']]= df[['sex','embarked','alive']].apply(lambda x: pp.fit_transform(x))

Could someone point me in the right direction please?

Ma0
  • 15,057
  • 4
  • 35
  • 65
choubix
  • 59
  • 1
  • 7
  • 1
    Note that `df.apply` is syntactic sugar for a Python `for-loop`. There is essentially no performance difference. – unutbu Dec 04 '17 at 15:54
  • In the future, you should provide the error messages when something "doesn't work"; otherwise, your question is likely to be closed. – Arya McCarthy Dec 07 '17 at 03:06

1 Answers1

0

I think the issue is that because you are passing three cols on the left, sklearn is getting confused.

Alternative

But as @unutbu said, there is no difference in performance between df.apply and for so I'd just use this:

for col in ['sex','embarked','alive']:
     df[col] = pp.fit_transform(df[col])

But if you really do a one liner, here's how you do it (warning, massive overkill):

Add another layer of indentation to the fit, tranform and fit_transform methods as the formatting doesn't work (should match the indentation of the def __init__ method.

class MultiColumnLabelBinarizer:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode`

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelBinarizer().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelBinarizer().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

df = MultiColumnLabelBinarizer(columns = ['embarked','alive']).fit_transform(df)

Source: Label encoding across multiple columns in scikit-learn

Arya McCarthy
  • 8,554
  • 4
  • 34
  • 56
plumbus_bouquet
  • 443
  • 6
  • 7