I have a variable called sid_set
in my data.table
, toytable
:
toytable <- data.table(id = c(1, 2, 3, 4),
sid_set = c("a, b, c",
"c, b",
"a",
"d, b")
)
> toytable
id sid_set
1: 1 a, b, c
2: 2 c, b
3: 3 a
4: 4 d, b
so sid_set
is a string of variable length, where each string is composed of a set of distinct values. There are about 1,500 distinct possible values that may be observed in sid_set
.
I am trying to get dummy variables for each of the distinct possible values, like so:
dummy variables: a b c d
row1: 1 1 1 0
row2: 0 1 1 0
row3: 1 0 0 0
row4: 0 1 0 1
Given the format of my data and what I am trying to achieve, can someone please share some leads with respect to packages that may be able to help?
I have tried or looked into: Matrix
's sparse.model.matrix()
, caret
's createDataPartition
and tried to think of some dplyr
solution without much progress.
I have tried, a combination of Split column at delimiter in data frame and Convert a dataframe to presence absence matrix:
> foo <- data.frame(do.call('rbind', strsplit(as.character(toytable$sid_set),', ',fixed=TRUE)))
Warning message:
In rbind(c("a", "b", "c"), c("c", "b"), "a", c("d", "b")) :
number of columns of result is not a multiple of vector length (arg 2)
> head(foo)
X1 X2 X3
1 a b c
2 c b c
3 a a a
4 d b d
> df2 <- melt(foo, id.var = "X1")
Warning message:
attributes are not identical across measure variables; they will be dropped
> with(df2, table(V1, value))
Error in table(V1, value) : object 'V1' not found
EDIT:
Thanks to the commenters below, I have come up with:
> toytable <- data.table(id = c(1, 2, 3, 4),
+ sid_set = c("a, b, c",
+ "c, b",
+ "a",
+ "d, b")
+ )
> toytable
id sid_set
1: 1 a, b, c
2: 2 c, b
3: 3 a
4: 4 d, b
> foo <- data.frame(do.call('rbind', strsplit(as.character(toytable$sid_set),', ',fixed=TRUE)))
Warning message:
In rbind(c("a", "b", "c"), c("c", "b"), "a", c("d", "b")) :
number of columns of result is not a multiple of vector length (arg 2)
> toytable2 <- cbind(toytable[, "id"], foo)
> toytable2
id X1 X2 X3
1: 1 a b c
2: 2 c b c
3: 3 a a a
4: 4 d b d
> library(reshape2)
> df2 <- melt(toytable2, id.var = "id")
> df3 <- with(df2, table(id, value))
> df3
value
id a b c d
1 1 1 1 0
2 0 1 2 0
3 3 0 0 0
4 0 1 0 2
>
This is pretty good. But is there any way to avoid a value > 1 in df3
?