how to get one hot encoding for a sentence?

Question

I have a list which contains a sentence, I want to perform one hot encode for a complete sentence in each word,

For Example,

sentences = [
  "python, java",
  "linux, windows, ubuntu",
  "java, linux, ubuntu, windows",
  "performance, python, mac"
]

I want output like this,

   java  linux  mac  performance  python  ubuntu  windows
0     1      0    0            0       1       0        0
1     0      1    0            0       0       1        1
2     1      1    0            0       0       1        1
3     0      0    1            1       1       0        0

My attempt,

I tried to convert my sentences into series then perform get_dummies but I'm getting for each word but not by sentence.

print pd.get_dummies(pd.Series(sum([tag.split(', ') for tag in sentences],[])))

O/P

    java  linux  mac  performance  python  ubuntu  windows
0      0      0    0            0       1       0        0
1      1      0    0            0       0       0        0
2      0      1    0            0       0       0        0
3      0      0    0            0       0       0        1
4      0      0    0            0       0       1        0
5      1      0    0            0       0       0        0
6      0      1    0            0       0       0        0
7      0      0    0            0       0       1        0
8      0      0    0            0       0       0        1
9      0      0    0            1       0       0        0
10     0      0    0            0       1       0        0
11     0      0    1            0       0       0        0

How to solve this?

jezrael · Accepted Answer · 2018-11-29T10:15:47.113

Use MultiLabelBinarizer with list comprehension for split:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform([x.split(', ') for x in sentences]),columns=mlb.classes_)
print (df)
   java  linux  mac  performance  python  ubuntu  windows
0     1      0    0            0       1       0        0
1     0      1    0            0       0       1        1
2     1      1    0            0       0       1        1
3     0      0    1            1       1       0        0

Another solution with Series.str.get_dummies:

print (pd.Series(sentences).str.get_dummies(', '))
   java  linux  mac  performance  python  ubuntu  windows
0     1      0    0            0       1       0        0
1     0      1    0            0       0       1        1
2     1      1    0            0       0       1        1
3     0      0    1            1       1       0        0

Performance is different:

sentences = sentences * 1000

In [166]: %%timeit
     ...: mlb = MultiLabelBinarizer()
     ...: df = pd.DataFrame(mlb.fit_transform([x.split(', ') for x in sentences]),columns=mlb.classes_)
     ...: 
8.06 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [167]: %%timeit
     ...: pd.Series(sentences).str.get_dummies(', ')
     ...: 
105 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How fast you are !!! really amazing, to create this question i took 5 minutes, but you answered this in 1 minute. It's unbelievable!!!! — Mohamed Thasin ah, Nov 29 '18 at 10:12
@MohamedThasinah - You are fast, I create question 10-20 minutes. It is much harder like answering... — jezrael, Nov 29 '18 at 10:13
@MohamedThasinah, exactly my thought, the answer is ready in less than a minute... Amazing. — Deepak Saini, Nov 29 '18 at 10:17

how to get one hot encoding for a sentence?

1 Answers1