Multiple aggregations (categorical and numeric) with dplyr in one chain

Question

I had a problem today figuring out a way to do an aggregation in dplyr in R but for some reason was unable to come up with a solution (although I think this should be quite easy).

I have a data set like this:

structure(list(date = structure(c(16431, 16431, 16431, 16432, 
16432, 16432, 16433, 16433, 16433), class = "Date"), colour = structure(c(3L, 
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L), .Label = c("blue", "green", 
"red"), class = "factor"), shape = structure(c(2L, 2L, 3L, 3L, 
3L, 2L, 1L, 1L, 1L), .Label = c("circle", "square", "triangle"
), class = "factor"), value = c(100, 130, 100, 180, 125, 190, 
120, 100, 140)), .Names = c("date", "colour", "shape", "value"
), row.names = c(NA, -9L), class = "data.frame")

which shows like this:

        date colour    shape value
1 2014-12-27    red   square   100
2 2014-12-27   blue   square   130
3 2014-12-27   blue triangle   100
4 2014-12-28  green triangle   180
5 2014-12-28  green triangle   125
6 2014-12-28    red   square   190
7 2014-12-29    red   circle   120
8 2014-12-29   blue   circle   100
9 2014-12-29   blue   circle   140

My goal is to calculate the most frequent colour, shape and the mean value per day. My expected output is the following:

        date colour    shape value
1 27/12/2014   blue   square   110
2 28/12/2014  green triangle   165
3 29/12/2014   blue   circle   120

I ended up doing it using split and writing my own function to calculate the above for a data.frame, then used snow::clusterApply to run it in parallel. It was efficient enough (my original dataset is about 10M rows long) but I am wondering whether this can happen in one chain using dplyr. Efficiency is really important for this so being able to run it in one chain is quite important.

Have you tried using `group_by` and `summarize`? [This approach](http://stackoverflow.com/questions/18433647/how-to-get-the-most-frequent-level-of-a-categorical-variable-in-r) could get the most common level. — David Robinson, Apr 23 '15 at 23:37

score 10 · Accepted Answer · answered Apr 23 '15 at 23:42

10

You could do

dat %>% group_by(date) %>%
    summarize(colour = names(which.max(table(colour))),
              shape = names(which.max(table(shape))),
              value = mean(value))

answered Apr 23 '15 at 23:42

David Robinson

77,383
16
167
187

I knew I took the long road using `split` + `clusterApply`! Thanks a lot! This is what I was looking for and I got confused with how to combine the functions needed with `summarize`. Thanks again!! – LyzandeR Apr 23 '15 at 23:45
1

@LyzandeR very happy to help! – David Robinson Apr 23 '15 at 23:46

Multiple aggregations (categorical and numeric) with dplyr in one chain

1 Answers1