One hot encoding with multiple ids, values in a large dataframe

Question

I have a data frame of the following type

id  alphabet
20  a
20  b
30  b
30  c

now, there are multiple non-unique ids. there are multiple non-unique alphabets also.
i would like the result in the following format

id  alphabet_a  alphabet_b  alphabet_c
    20  1           1         0
    30  0           1         1

so, rows have been combined based on unique id, and one-hot encoding has been done on the values (alphabets).
how can this be done on a large scale data frame?

If i do one-hot encoding of the current data frame given above, i get 4 rows of data with ids 20,20,30,30 and the appropriate columns. Then how can i merge (or join or add) two or more rows based on id. — Akhil, Nov 17 '17 at 07:15

score 0 · Accepted Answer · answered Nov 17 '17 at 07:15

0

You can use dcast like this

library(reshape2)

df <- read.table(text = "id  alphabet
             20  a
             20  b
             30  b
             30  c", header = T)

dcast(df, id~alphabet, fun = length)

  id a b c
1 20 1 1 0
2 30 0 1 1

answered Nov 17 '17 at 07:15

Hardik Gupta

4,700
9
41
83

it does seem to work!. thanks Hardik! – Akhil Nov 17 '17 at 07:20
can you please accept and upvote the answer if it solved your query – Hardik Gupta Nov 17 '17 at 07:20
Done!. Thanks once again – Akhil Nov 17 '17 at 07:24

One hot encoding with multiple ids, values in a large dataframe

1 Answers1