0

Let's start with:

df <- structure(list(user.id = c(2L, 3L, 1L, 3L, 1L, 4L), questions = structure(c(3L, 
3L, 3L, 1L, 3L, 2L), .Label = c("Do you own an xbox?", "How many game consoles do you own?", 
"which game did you buy recently?"), class = "factor"), answers = structure(c(2L, 
5L, 3L, 6L, 4L, 1L), .Label = c("3", "DOOM", "Fallout 3", "Ghost Recon", 
"Mario", "yes"), class = "factor")), .Names = c("user.id", "questions", 
"answers"), row.names = c(NA, -6L), class = "data.frame")

This gives us the data.frame

> df
  user.id                          questions     answers
1       2   which game did you buy recently?        DOOM
2       3   which game did you buy recently?       Mario
3       1   which game did you buy recently?   Fallout 3
4       3                Do you own an xbox?         yes
5       1   which game did you buy recently? Ghost Recon
6       4 How many game consoles do you own?           3

I'd like to transform this to a data.frame or equivalent where:

> matrixed
  user.id q_1 q_2         q_3
1       1         Ghost Recon
2       2                DOOM
3       3 yes           Mario
4       4       3   

Right now I'm using this primitive piece of code:

questions <- sort(unique(df$questions))
user.id <- unique(sort(df$user.id))
matrixed <- data.frame(user.id)
sapply(1:length(questions), function(i) matrixed[, paste0("q_", i)] <<- rep("", length(user.id)))
sapply(1:nrow(df), function(j) matrixed[df[j, ]$user.id, paste0("q_", which(df[j, ]$questions == questions))] <<- as.character(df[j, ]$answers))

Are there more elegant ways to do this -- perhaps libraries that help handle this type of data?

alistaire
  • 42,459
  • 4
  • 77
  • 117
user1172468
  • 5,306
  • 6
  • 35
  • 62
  • 3
    You need to figure out a way to deal with duplicates, e.g. `library(tidyverse); df %>% group_by(user.id, questions = paste0('q_', as.integer(questions))) %>% summarise(answers = toString(answers)) %>% spread(questions, answers)` – alistaire May 08 '17 at 06:56

1 Answers1

1

The standard tidyverse solution to this:

library(dplyr)
library(tidyr)

df %>% tidyr::spread(questions, answers)

Doesn't quite work in your case because user 1 answered which game did you buy recently? twice according to your data.

If you wanted to take the first answer for each user.id x question combination, you could use

df %>% 
  distinct(user.id, questions, .keep_all = TRUE) %>%
  tidyr::spread(questions, answers)

Careful: The result of this will depend on how your data is sorted.

RoyalTS
  • 9,545
  • 12
  • 60
  • 101
  • @RoyalTS, many thanks for the answer -- that was part of the problem -- so the idea is I could choose to use the latest answer to over-ride the first answer or perhaps use the first answer -- or some other treatment. I'm not familiar with tidyr -- I guess that's where I need to look – user1172468 May 08 '17 at 07:00
  • 1
    @user1172468 See my edit. – RoyalTS May 08 '17 at 07:07