3

My data frame is like below:

            a
0     [8, 10]
1  [12, 7, 9]

As you can see column a contains a list. Number inside that list has meaning in our domain and i want to use them as feature. My expected output is like below:

   Tag_7  Tag_8  Tag_9  Tag_10  Tag_12
0      0      1      0       1       0
1      1      0      1       0       1

I used some methods that i find from internet they satisfy my expectation but there is a execution time problem with that methods. One of them is like below:

pd.get_dummies(df.a.apply(pd.Series).stack().astype(int), prefix='Tag').sum(level=0)

I think this method is useful for small datasets. For my case it is not useful. I need help. Thanks in advance. Have a nice day

Logica
  • 977
  • 4
  • 16
Fatih Taşdemir
  • 266
  • 1
  • 5
  • 15

2 Answers2

2

Give scikit-learn a try to see if it helps

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
cols = np.unique(np.concatenate(df.a))
df_final = pd.DataFrame(mlb.fit_transform(df.a), columns=cols).add_prefix('T_')

Out[213]:
   T_7  T_8  T_9  T_10  T_12
0    0    1    0     1     0
1    1    0    1     0     1

If you need to squeeze every ms, use chain.from_iterable is faster than np.concatenate and use np.char.add to T_ to the column names

from sklearn.preprocessing import MultiLabelBinarizer
from itertools import chain

mlb = MultiLabelBinarizer()
cols = np.char.add('T_', np.unique(list(chain.from_iterable(df.a))).astype(str))
df_final = pd.DataFrame(mlb.fit_transform(df.a), columns=cols)
Andy L.
  • 24,909
  • 4
  • 17
  • 29
0

A bit of hacking, but you can do like this:

df['bitsum'] = df['input'].apply(lambda lst: sum(1 << x for x in lst))
pd.Series(np.array(list(map(lambda x: f'{x:b}', df['bitsum'])))).apply(lambda x: x[::-1]).str.split('')

Not sure if it works faster though. If you know how many features you have, you can replace 1 << x by 1 << (n_max - x) and so a) get rid of the string reversal apply(lambda x: x[::-1]), b) use bin instead of lambda x: f'{x:b}' that may seem faster, too.

Oleg O
  • 1,005
  • 6
  • 11