2

I am working on a dataset (namely w3) in R that looks like this:

Q1   Q2   Q3   Q4 WorksheetID UserID
395 2178 2699 1494           3     65
395 2178 2694 1394           3     78
395 1178 2699 1394           3     79
395  278 2699 1394           3     80
295 1188 2799 1494           3     81
395 2278 2699 2394           3     81
395 2178 2699 1394           3     83
495 1178 2709 1394           3     84
395 2198 2799 1294           3     85
395 2178 2699 1394           3     85
395 1178 2699 1394           3     86

Each User has answered 4 questions(Q1,Q2,Q3,Q4) of the worksheet.

What I want to do is to group the users according to the answers the answered in each Question.

e.g. for Q1

Q1   freq UserID
295  1    81
395  9    65 78 79 80 81 83 85 85 86
495  1    84

What I have done is

w3Q1<-count(W3,"Q1")
for(i in w3Q1$Q1)
  {qry<-paste('select userID from w3 where Q1=',i)
  print(i)
  print(sqldf(qry))}

My question is, how do I combine the above results and turn it into the table with the variables Q1, freq, UserID (as shown above)? Or is there any simpler way?

Thanks a lot

tiffkyn
  • 79
  • 1
  • 7

2 Answers2

3

We can first tally the frequencies with table and coerce it to data.frame. Then we can group and paste the UserID with aggregate and merge the two data frames:

tbl <- as.data.frame(table(df$Q1))
ids <- aggregate(UserID~Q1, df, paste, collapse=' ')
merge(tbl, ids, by.x='Var1', by.y='Q1')
#   Var1 Freq                     UserID
# 1  295    1                         81
# 2  395    9 65 78 79 80 81 83 85 85 86
# 3  495    1                         84
Pierre L
  • 28,203
  • 6
  • 47
  • 69
2

We can also group by 'Q1' and summarise using dplyr.

library(dplyr)
df1 %>%
   group_by(Q1) %>%
   summarise(freq= n(), UserID= toString(UserID))
#     Q1  freq                             UserID
#   (int) (int)                              (chr)
#1   295     1                                 81
#2   395     9 65, 78, 79, 80, 81, 83, 85, 85, 86
#3   495     1                                 84

For questions Q1:Q4, it may be better to reshape to long format with gather from tidyr.

library(tidyr)
gather(df1, Var, Qs, Q1:Q4) %>% 
        group_by(Var, Qs) %>%
        summarise(freq=n(), UserID = toString(UserID))

As @Steven Beaupré mentioned, if we need a list we can replace the toString with list(UserID)


A similar approach using data.table would be to convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Q1', we get the nrow (.N) as 'freq' and paste together the 'UserID'

library(data.table)
setDT(df1)[, list(freq=.N, UserID=toString(UserID)), by = Q1]

For the multiple Q columns, we melt to 'long' format and then summarise by group.

melt(setDT(df1), measure=1:4)[, list(freq=.N, 
      UserID= toString(UserID)) ,.(variable, value)]
akrun
  • 874,273
  • 37
  • 540
  • 662