Scikit learn preprocessing LabelBinarizer with lambda function

Question

I am trying my hand on the Titanic dataset.

I would like to use the LabelBinarizer on a few columns and I would like to avoid using a for loop.

I am trying to use a lambda function but it doesn't work:

from sklearn.preprocessing import LabelBinarizer 

pp = LabelBinarizer()

X = df['sex', 'embarked', 'alive'] df.apply(lambda X: pp.fit_transform())

And:

df[['sex','embarked','alive']]= df[['sex','embarked','alive']].apply(lambda x: pp.fit_transform(x))

Could someone point me in the right direction please?

Note that `df.apply` is syntactic sugar for a Python `for-loop`. There is essentially no performance difference. — unutbu, Dec 04 '17 at 15:54
In the future, you should provide the error messages when something "doesn't work"; otherwise, your question is likely to be closed. — Arya McCarthy, Dec 07 '17 at 03:06

score 0 · Accepted Answer · edited Dec 07 '17 at 03:07

I think the issue is that because you are passing three cols on the left, sklearn is getting confused.

Alternative

But as @unutbu said, there is no difference in performance between df.apply and for so I'd just use this:

for col in ['sex','embarked','alive']:
     df[col] = pp.fit_transform(df[col])

But if you really do a one liner, here's how you do it (warning, massive overkill):

Add another layer of indentation to the fit, tranform and fit_transform methods as the formatting doesn't work (should match the indentation of the def __init__ method.

class MultiColumnLabelBinarizer:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode`

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelBinarizer().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelBinarizer().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

df = MultiColumnLabelBinarizer(columns = ['embarked','alive']).fit_transform(df)

Source: Label encoding across multiple columns in scikit-learn

Thank you for your thorough answer, much appreciated! – choubix Dec 06 '17 at 13:53 — choubix, Dec 06 '17 at 13:53

Scikit learn preprocessing LabelBinarizer with lambda function

1 Answers1