1

I'm working with multiple response questions in a survey, and I have a character column that contains values that look like "1,2,3" and "1,4,5". The participants click all values that apply, and I"m given this result.

What is the best solution to deal with this problem? Should I create new columns that tell me if a value in that list is present or not? Or can I create a column that has a list/vector class?

blaze
  • 57
  • 1
  • 3
  • What is your desired output? Yes, in general you should avoid working with CSV data if you can. – Tim Biegeleisen Sep 11 '19 at 00:47
  • 2
    Technically, yes, `list`-columns can work in frames (try `mtcars$new <- Map(c, mtcars$gear, mtcars$carb)`), but some frame-friendly tools don't always react well to them (though there are always workarounds). A different approach might be to store the different values in a "long" format instead of storing multiple values in a single "cell". This takes restructuring of the rest of the data, so is not simple enough for a comment (and needs a more-reproducible problem). – r2evans Sep 11 '19 at 00:54
  • Convert those values into separate rows. https://stackoverflow.com/questions/15347282/split-delimited-strings-in-a-column-and-insert-as-new-rows – Ronak Shah Sep 11 '19 at 01:08
  • 1
    Really it depends on your goals - what do you want to *do* with the data?? But yes, in most cases I agree with Ronak, separate rows are easiest. – Gregor Thomas Sep 11 '19 at 01:26
  • If you only have 5 options, then it sometimes makes sense to create 5 columns that are just 0/1 to signify whether or not that response was ticked, rather than going to separate rows. So you could have columns like `Q1_a`, `Q1_b`, etc., even better if you replace `a` and `b` with the actual name of what they selected. – Marius Sep 11 '19 at 01:29

1 Answers1

0

One can't say what is best without knowing the purpose but storing them as indicator columns, i.e. one 0/1 column per option, would let you perform a regression or tabulate them easily. Here we convert x into a 0/1 matrix m and then consider what fraction of respondents answered yes to each question and we also regress with them in various ways of which two are shown, take various correlations and plots.

We also show a plot based on applying stack from to the list representation so it might be useful to use more than one representation and convert among them.

x <- c("1,2,3", "1,4,5")
m <- t(+outer(1:5, lapply(strsplit(x, ","), as.numeric), Vectorize(`%in%`)))

colMeans(m)

y <- 1:2
lm(y ~ m+0)
lapply(1:5, function(i) glm(m[, i] ~ y, family = binomial()))

cor(m)
cor(t(m))

heatmap(m)

stk <- stack(setNames(lapply(strsplit(x, ","), as.numeric), seq_along(x)))
plot(stk)

Here is a data frame with 4 different possibilities:

library(dst) # encode/decode

DF <- data.frame(x, stringsAsFactors = FALSE)
DF$list <- strsplit(x, ",")
DF <- cbind(DF, m, code = apply(m, 1, decode, base = 2))
DF
##       x     list  1 2 3 4 5  code
## 1 1,2,3  1, 2, 3  1 1 1 0 0    28
## 2 1,4,5  1, 4, 5  1 0 0 1 1    19

Note that decode converts 0/1 values into a numeric value and encode can be used to reverse that:

t(encode(base = rep(2, 5), c(28, 19)))
##   [,1] [,2] [,3] [,4] [,5]
## r    1    1    1    0    0
##      1    0    0    1    1
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341