How do I, in R, transform a dataset which has a response per row into a matrix where columns are answers?

Question

Let's start with:

df <- structure(list(user.id = c(2L, 3L, 1L, 3L, 1L, 4L), questions = structure(c(3L, 
3L, 3L, 1L, 3L, 2L), .Label = c("Do you own an xbox?", "How many game consoles do you own?", 
"which game did you buy recently?"), class = "factor"), answers = structure(c(2L, 
5L, 3L, 6L, 4L, 1L), .Label = c("3", "DOOM", "Fallout 3", "Ghost Recon", 
"Mario", "yes"), class = "factor")), .Names = c("user.id", "questions", 
"answers"), row.names = c(NA, -6L), class = "data.frame")

This gives us the data.frame

> df
  user.id                          questions     answers
1       2   which game did you buy recently?        DOOM
2       3   which game did you buy recently?       Mario
3       1   which game did you buy recently?   Fallout 3
4       3                Do you own an xbox?         yes
5       1   which game did you buy recently? Ghost Recon
6       4 How many game consoles do you own?           3

I'd like to transform this to a data.frame or equivalent where:

> matrixed
  user.id q_1 q_2         q_3
1       1         Ghost Recon
2       2                DOOM
3       3 yes           Mario
4       4       3

Right now I'm using this primitive piece of code:

questions <- sort(unique(df$questions))
user.id <- unique(sort(df$user.id))
matrixed <- data.frame(user.id)
sapply(1:length(questions), function(i) matrixed[, paste0("q_", i)] <<- rep("", length(user.id)))
sapply(1:nrow(df), function(j) matrixed[df[j, ]$user.id, paste0("q_", which(df[j, ]$questions == questions))] <<- as.character(df[j, ]$answers))

Are there more elegant ways to do this -- perhaps libraries that help handle this type of data?

You need to figure out a way to deal with duplicates, e.g. `library(tidyverse); df %>% group_by(user.id, questions = paste0('q_', as.integer(questions))) %>% summarise(answers = toString(answers)) %>% spread(questions, answers)` — alistaire, May 08 '17 at 06:56

RoyalTS · Accepted Answer · 2017-05-08T07:03:34.653

1

The standard tidyverse solution to this:

library(dplyr)
library(tidyr)

df %>% tidyr::spread(questions, answers)

Doesn't quite work in your case because user 1 answered which game did you buy recently? twice according to your data.

If you wanted to take the first answer for each user.id x question combination, you could use

df %>% 
  distinct(user.id, questions, .keep_all = TRUE) %>%
  tidyr::spread(questions, answers)

Careful: The result of this will depend on how your data is sorted.

edited May 08 '17 at 07:03

answered May 08 '17 at 06:55

RoyalTS

9,545
12
60
101

@RoyalTS, many thanks for the answer -- that was part of the problem -- so the idea is I could choose to use the latest answer to over-ride the first answer or perhaps use the first answer -- or some other treatment. I'm not familiar with tidyr -- I guess that's where I need to look – user1172468 May 08 '17 at 07:00
1

@user1172468 See my edit. – RoyalTS May 08 '17 at 07:07

How do I, in R, transform a dataset which has a response per row into a matrix where columns are answers?

1 Answers1