R: many-hot encoding of a string of variable length values

Question

I have a variable called sid_set in my data.table, toytable:

toytable <- data.table(id = c(1, 2, 3, 4),
                       sid_set = c("a, b, c", 
                                   "c, b", 
                                   "a", 
                                   "d, b") 
                       )
> toytable
   id sid_set
1:  1 a, b, c
2:  2    c, b
3:  3       a
4:  4    d, b

so sid_set is a string of variable length, where each string is composed of a set of distinct values. There are about 1,500 distinct possible values that may be observed in sid_set.

I am trying to get dummy variables for each of the distinct possible values, like so:

dummy variables:    a     b     c     d
row1:               1     1     1     0
row2:               0     1     1     0
row3:               1     0     0     0
row4:               0     1     0     1

Given the format of my data and what I am trying to achieve, can someone please share some leads with respect to packages that may be able to help?

I have tried or looked into: Matrix's sparse.model.matrix(), caret's createDataPartition and tried to think of some dplyr solution without much progress.

I have tried, a combination of Split column at delimiter in data frame and Convert a dataframe to presence absence matrix:

> foo <- data.frame(do.call('rbind', strsplit(as.character(toytable$sid_set),', ',fixed=TRUE)))
Warning message:
In rbind(c("a", "b", "c"), c("c", "b"), "a", c("d", "b")) :
  number of columns of result is not a multiple of vector length (arg 2)
> head(foo)
  X1 X2 X3
1  a  b  c
2  c  b  c
3  a  a  a
4  d  b  d
> df2 <- melt(foo, id.var = "X1")
Warning message:
attributes are not identical across measure variables; they will be dropped 
> with(df2, table(V1, value))
Error in table(V1, value) : object 'V1' not found

EDIT:

Thanks to the commenters below, I have come up with:

> toytable <- data.table(id = c(1, 2, 3, 4),
+                        sid_set = c("a, b, c", 
+                                    "c, b", 
+                                    "a", 
+                                    "d, b") 
+                        )
> toytable
   id sid_set
1:  1 a, b, c
2:  2    c, b
3:  3       a
4:  4    d, b
> foo <- data.frame(do.call('rbind', strsplit(as.character(toytable$sid_set),', ',fixed=TRUE)))
Warning message:
In rbind(c("a", "b", "c"), c("c", "b"), "a", c("d", "b")) :
  number of columns of result is not a multiple of vector length (arg 2)
> toytable2 <- cbind(toytable[, "id"], foo)
> toytable2
   id X1 X2 X3
1:  1  a  b  c
2:  2  c  b  c
3:  3  a  a  a
4:  4  d  b  d
> library(reshape2)
> df2 <- melt(toytable2, id.var = "id")
> df3 <- with(df2, table(id, value))
> df3
   value
id  a b c d
  1 1 1 1 0
  2 0 1 2 0
  3 3 0 0 0
  4 0 1 0 2
>

This is pretty good. But is there any way to avoid a value > 1 in df3?

First separate the [comma delimited values into different columns](https://stackoverflow.com/questions/7069076/split-column-at-delimiter-in-data-frame) and then [create a presence-absence matrix](https://stackoverflow.com/questions/22566592/convert-a-dataframe-to-presence-absence-matrix). — Ronak Shah, Feb 27 '18 at 01:41
@RonakShah interesting solutions, however the first solution did not work as intended. i've added details in original post. — user2205916, Feb 27 '18 at 01:54
It would work if you used the right variable name in your `table` statement - there is no `V1` in the `df2` data. — thelatemail, Feb 27 '18 at 02:06

R: many-hot encoding of a string of variable length values

0 Answers0