0

I have given the results of several surveys as data frames with the questions being the columns and the answers being the rows. That is, the data frame is of the following form:

  Q1 Q2 Q3 ... Qn
1  5  4  5 ...  2
2  5  5 NA ...  3
3  2  4  1 ...  2
4  3  3  3 ...  3
5  5  3  5 ...  1
...

The number of questions (= columns) varies from survey to survey (and is usually quite large), but given the answers are always integers in the range from 1 to 5, or NA if no answer has been given.

I need to transform this data into a (long form) data frame representing the frequency count of the given answers for each question, that is:

question value freq
      Q1     1   12
      Q1     2   41
      Q1     3  123
      Q1     4  231
      Q1     5  401
      Q2     1   11
      Q2     2   32
      Q2     3  122
      Q2     4  321
      Q2     5  173
...

However, I was unable to come up with a solution achieving this. I understand that data.frame(table(survey$Q1)) kind of produces the frequency count I am looking for, but only for a single question. Combining all these data frames "by hand" for each of the large number of questions is infeasible. When possible, I am also looking for a rather general solution that can handle the variable number of questions in my different surveys.

Thanks in advance, any help is appreciated.

Code snippet for generating sample data:

Q1 = c(5, 5, 2, 3, 5, 4, 3, 5, 2, 3)
Q2 = c(4, 5, 4, 3, 3, 5, 3, 5, 4, 3) 
Q3 = c(5, NA, 1, 3, 5, 5, 2, 3, 5, 5) 
Qn = c(2, 3, 2, 3, 1, NA, 3, 2, 3, 1)
survey <- data.frame(Q1,Q2,Q3,Qn)
Jaap
  • 81,064
  • 34
  • 182
  • 193
MRA
  • 2,992
  • 1
  • 16
  • 18

3 Answers3

4

We could try by converting the 'wide' format to 'long' format with melt, convert to data.table (setDT), get the nrows (.N) grouped by 'value', 'variable'

library(reshape2)
library(data.table)
melt(setDT(survey), na.rm=TRUE)[, .N, by = list(variable,value)]

Or a base R method would be

subset(as.data.frame(table(stack(survey))), Freq!=0)
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I like this base solution in particular as the user can omit the subset() if they want the fully-populated result set with zeroes. – Sam Firke Mar 25 '15 at 19:32
  • Both solutions seem to work fine for me, many thanks. I am always amazed by the conciseness of most R solutions. I just wish I could come up with such solutions myself. :/ – MRA Mar 25 '15 at 19:33
1

You can use the tidyr package to create a long data set, then the dplyr package to create the counts of subgroups:

library(tidyr)
library(dplyr)
long <- gather(survey, question, value, na.rm = TRUE)

long %>%
  group_by(question, value) %>%
  tally() %>% # populate the counts
  arrange(question, value) # sort the results
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • Seems to be a reasonable solution, but I am unable to install tidyr. Trying install.packages("tidyr") just leads to a "Warning message: package ‘tidyr’ is not available" for me. Am I doing something wrong with the installer, or do I have to "sideload" the package somehow? – MRA Mar 25 '15 at 19:37
  • Strange - when I run that exact command it installs fine for me. It's on CRAN so should install that easily. Some troubleshooting ideas: http://stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-warning – Sam Firke Mar 25 '15 at 19:42
1

tabulate is a good candidate here. Combine it with lapply and stack, and use cbind to add an indicator of 1 to 5 to represent the value that you are looking for. This is a slight variation to @akrun's table + stack approach that should be a little more efficient if you have a lot of rows and a lot of columns:

cbind(Val = 1:5, stack(lapply(survey, tabulate, nbins = 5)))
#    Val values ind
# 1    1      0  Q1
# 2    2      2  Q1
# 3    3      3  Q1
# 4    4      1  Q1
# 5    5      4  Q1
# 6    1      0  Q2
# 7    2      0  Q2
# 8    3      4  Q2
# 9    4      3  Q2
# 10   5      3  Q2
# 11   1      1  Q3
# 12   2      1  Q3
# 13   3      2  Q3
# 14   4      0  Q3
# 15   5      5  Q3
# 16   1      2  Qn
# 17   2      3  Qn
# 18   3      4  Qn
# 19   4      0  Qn
# 20   5      0  Qn

You can rename the columns "on the fly" with setNames:

cbind(Value = 1:5, 
      setNames(stack(lapply(survey, tabulate, nbins = 5)), 
               c("Freq", "Question")))

Here's how base compares:

f1 <- function() cbind(Val = 1:5, stack(lapply(survey, tabulate, nbins = 5)))
f2 <- function() as.data.frame(table(stack(survey)))
f3 <- function() melt(as.data.table(survey), na.rm=TRUE)[, .N, by = list(variable,value)]
f4 <- function() {
  long <- gather(survey, question, value, na.rm = TRUE)

  long %>%
    group_by(question, value) %>%
    tally() %>% # populate the counts
    arrange(question, value) # sort the results
}

library(tidyr)
library(dplyr)
library(data.table)
library(microbenchmark)

## A bigger dataset
set.seed(1)
survey <- data.frame(do.call(cbind, replicate(100, list(sample(c(1:5, NA), 10000, TRUE)))))

system.time(f2())
#    user  system elapsed 
#   0.801   0.000   0.802 
system.time(f4())
#    user  system elapsed 
#   0.261   0.000   0.268 

microbenchmark(f1(), f3()) # The warnings are from `melt`
# Unit: milliseconds
#  expr      min        lq      mean   median         uq       max neval
#  f1()  2.96567  3.772974  5.157065  4.16065   4.793876  13.51471   100
#  f3() 23.79500 77.893235 91.029498 87.22616 102.263556 147.69982   100
# There were 50 or more warnings (use warnings() to see the first 50)
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485