One can't say what is best without knowing the purpose but storing them as indicator columns, i.e. one 0/1 column per option, would let you perform a regression or tabulate them easily. Here we convert x
into a 0/1 matrix m
and then consider what fraction of respondents answered yes to each question and we also regress with them in various ways of which two are shown, take various correlations and plots.
We also show a plot based on applying stack
from to the list representation so it might be useful to use more than one representation and convert among them.
x <- c("1,2,3", "1,4,5")
m <- t(+outer(1:5, lapply(strsplit(x, ","), as.numeric), Vectorize(`%in%`)))
colMeans(m)
y <- 1:2
lm(y ~ m+0)
lapply(1:5, function(i) glm(m[, i] ~ y, family = binomial()))
cor(m)
cor(t(m))
heatmap(m)
stk <- stack(setNames(lapply(strsplit(x, ","), as.numeric), seq_along(x)))
plot(stk)
Here is a data frame with 4 different possibilities:
library(dst) # encode/decode
DF <- data.frame(x, stringsAsFactors = FALSE)
DF$list <- strsplit(x, ",")
DF <- cbind(DF, m, code = apply(m, 1, decode, base = 2))
DF
## x list 1 2 3 4 5 code
## 1 1,2,3 1, 2, 3 1 1 1 0 0 28
## 2 1,4,5 1, 4, 5 1 0 0 1 1 19
Note that decode
converts 0/1 values into a numeric value and encode can be used to reverse that:
t(encode(base = rep(2, 5), c(28, 19)))
## [,1] [,2] [,3] [,4] [,5]
## r 1 1 1 0 0
## 1 0 0 1 1