6

I had a problem today figuring out a way to do an aggregation in dplyr in R but for some reason was unable to come up with a solution (although I think this should be quite easy).

I have a data set like this:

structure(list(date = structure(c(16431, 16431, 16431, 16432, 
16432, 16432, 16433, 16433, 16433), class = "Date"), colour = structure(c(3L, 
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L), .Label = c("blue", "green", 
"red"), class = "factor"), shape = structure(c(2L, 2L, 3L, 3L, 
3L, 2L, 1L, 1L, 1L), .Label = c("circle", "square", "triangle"
), class = "factor"), value = c(100, 130, 100, 180, 125, 190, 
120, 100, 140)), .Names = c("date", "colour", "shape", "value"
), row.names = c(NA, -9L), class = "data.frame")

which shows like this:

        date colour    shape value
1 2014-12-27    red   square   100
2 2014-12-27   blue   square   130
3 2014-12-27   blue triangle   100
4 2014-12-28  green triangle   180
5 2014-12-28  green triangle   125
6 2014-12-28    red   square   190
7 2014-12-29    red   circle   120
8 2014-12-29   blue   circle   100
9 2014-12-29   blue   circle   140

My goal is to calculate the most frequent colour, shape and the mean value per day. My expected output is the following:

        date colour    shape value
1 27/12/2014   blue   square   110
2 28/12/2014  green triangle   165
3 29/12/2014   blue   circle   120

I ended up doing it using split and writing my own function to calculate the above for a data.frame, then used snow::clusterApply to run it in parallel. It was efficient enough (my original dataset is about 10M rows long) but I am wondering whether this can happen in one chain using dplyr. Efficiency is really important for this so being able to run it in one chain is quite important.

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • Have you tried using `group_by` and `summarize`? [This approach](http://stackoverflow.com/questions/18433647/how-to-get-the-most-frequent-level-of-a-categorical-variable-in-r) could get the most common level. – David Robinson Apr 23 '15 at 23:37

1 Answers1

10

You could do

dat %>% group_by(date) %>%
    summarize(colour = names(which.max(table(colour))),
              shape = names(which.max(table(shape))),
              value = mean(value))
David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • I knew I took the long road using `split` + `clusterApply`! Thanks a lot! This is what I was looking for and I got confused with how to combine the functions needed with `summarize`. Thanks again!! – LyzandeR Apr 23 '15 at 23:45
  • 1
    @LyzandeR very happy to help! – David Robinson Apr 23 '15 at 23:46