10

So I have the following data:

>>> test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])
>>> test

0    [a, b, e]
1       [c, a]
2          [d]
3          [d]
4          [e]

I am trying to one-hot-encode all of the data in the lists back into my dataframe. To look like this:

>>> pd.DataFrame([[1, 1, 0, 0, 1], [1, 0, 1, 0, 0],
              [0, 0, 0, 1, 0], [0, 0, 0, 1, 0],
              [0, 0, 0, 0, 1]],
             columns = ['a', 'b', 'c', 'd', 'e'])

    a   b   c   d   e
0   1   1   0   0   1
1   1   0   1   0   0
2   0   0   0   1   0
3   0   0   0   1   0
4   0   0   0   0   1

I have tried researching and I've found similar problems but none like this. I have attempted:

test.apply(pd.Series)

But that doesn't quite accomplish the one-hot aspect. That simply unpacks my lists in an arbitrary order. I'm sure I could figure out a lengthly solution but I'd be glad to hear if there's a more elegant way to perform this.

Thanks!

EDIT: I am aware that I can iterate through my test series, then create a column for each unique value found, then go back and iterate through test again, flagging said columns for unique values. But that doesn't seem very pandorable to me and I'm sure there's a more elegant way to do this.

Brian
  • 1,572
  • 9
  • 18
  • 2
    Scikit is about to get an upgraded onehotencoder that will encode strings, FWIW: https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62 – Dance Party2 Sep 05 '18 at 16:00

1 Answers1

17

MultiLabelBinarizer from the sklearn library is more efficient for these problems. It should be preferred to apply with pd.Series. Here's a demo:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])

mlb = MultiLabelBinarizer()

res = pd.DataFrame(mlb.fit_transform(test),
                   columns=mlb.classes_,
                   index=test.index)

Result

   a  b  c  d  e
0  1  1  0  0  1
1  1  0  1  0  0
2  0  0  0  1  0
3  0  0  0  1  0
4  0  0  0  0  1
jpp
  • 159,742
  • 34
  • 281
  • 339
  • If in the pandas `pd.DataFrame(test.values.tolist()).stack().str.get_dummies().sum(level=0) ` – BENY Sep 05 '18 at 16:05
  • @Wen, Yep, good one. Shameful to say but I've never conceptually appreciated `stack`, hence very rarely use it. Also seems to be the weak link in efficiency. – jpp Sep 05 '18 at 16:07
  • 3
    then just do not use it :-) `pd.get_dummies(pd.DataFrame(test.values.tolist()),prefix_sep ='',prefix='').sum(level=0, axis = 1) ` – BENY Sep 05 '18 at 16:10
  • I didn't know about this tool and I guess I didn't word my google searches well enough to find it. This is perfect. Thank you! – Brian Sep 05 '18 at 19:43
  • Just a heads up: Using MultiLabelBinarizer is several times faster than the `stack().str.get_dummies.sum()` approach. – Nicolai B. Thomsen Oct 14 '21 at 09:09