58

I want to start using dplyr in place of ddply but I can't get a handle on how it works (I've read the documentation).

For example, why when I try to mutate() something does the "group_by" function not work as it's supposed to?

Looking at mtcars:

library(car)

Say I make a data.frame which is a summary of mtcars, grouped by "cyl" and "gear":

df1 <- mtcars %.%
            group_by(cyl, gear) %.%
            summarise(
                newvar = sum(wt)
            )

Then say I want to further summarise this dataframe. With ddply, it'd be straightforward, but when I try to do with with dplyr, it's not actually "grouping by":

df2 <- df1 %.%
            group_by(cyl) %.%
            mutate(
                newvar2 = newvar + 5
            )

Still yields an ungrouped output:

  cyl gear newvar newvar2
1   6    3  6.675  11.675
2   4    4 19.025  24.025
3   6    4 12.375  17.375
4   6    5  2.770   7.770
5   4    3  2.465   7.465
6   8    3 49.249  54.249
7   4    5  3.653   8.653
8   8    5  6.740  11.740

Am I doing something wrong with the syntax?


Edit:

If I were to do this with plyr and ddply:

df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))

and then to get the second df:

df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)

But that same approach, with sum(newvar) + 5 in the summarise() function doesn't work with dplyr...

smci
  • 32,567
  • 20
  • 113
  • 146
Marc Tulla
  • 1,751
  • 2
  • 20
  • 34
  • 1
    Can you give us the equivalent `plyr` code with `ddply` please ? – dickoa Feb 08 '14 at 23:58
  • what do you mean by "ungrouped"? where you expecting one row per group? or where you expecting that all rows from a same group be below each other? – flodel Feb 09 '14 at 00:08
  • I'd expect just three rows for the second df (one for each cyl), as it looks with the ddply arguments that I just added in the edits... I assume this is just a matter of adding one argument somewhere that I'm missing? – Marc Tulla Feb 09 '14 at 00:09
  • 3
    Then I think you are confusing `mutate` and `summarise`. – flodel Feb 09 '14 at 00:10
  • 2
    Ah, so I am. Will summarise be as efficient as mutate if I want to summarise a dataframe while also adding new variables? – Marc Tulla Feb 09 '14 at 00:15

5 Answers5

81

I had a similar problem. I found that simply detaching plyr solved it:

detach(package:plyr)    
library(dplyr)
Ram Narasimhan
  • 22,341
  • 5
  • 49
  • 55
ManneR
  • 811
  • 6
  • 3
  • 13
    Been sitting here pulling my hair out for the last hour and a half trying to understand why dplyr was simply ignoring my groupings. Glad to know I'm not just crazy. – Brandon Bertelsen Feb 22 '16 at 19:13
  • 6
    I couldn't figure out why code ran fine once using `summarize` but not upon visiting it later. Indeed, I'd added `plyr` after loading `dplyr`. This is why. Not sure if it's a recent addition, but I caught this recently when loading the two: `You have loaded plyr after dplyr - this is likely to cause problems. If you need functions from both plyr and dplyr, please load plyr first, then dplyr: library(plyr); library(dplyr)`. – Hendy Jun 15 '16 at 21:07
  • 1
    This happens often with `dplyr` methods being overloaded. A general solution is to explicitly reference the `dplyr`'s version of the method using `dplyr::summerise(...)`. – passerby51 Mar 19 '20 at 02:44
45

Taking Dickoa's answer one step further -- as Hadley says "summarise peels off a single layer of grouping". It peels off grouping from the reverse order in which you applied it so you can just use

mtcars %>%
 group_by(cyl, gear) %>%
 summarise(newvar = sum(wt)) %>%
 summarise(newvar2 = sum(newvar) + 5)

Note that this will give a different answer if you use group_by(gear, cyl) in the second line.

And to get your first attempt working:

df1 <- mtcars %>%
 group_by(cyl, gear) %>%
 summarise(newvar = sum(wt))

df2 <- df1 %>%
 group_by(cyl) %>%
 summarise(newvar2 = sum(newvar)+5)
Tim Cameron
  • 898
  • 6
  • 9
  • 16
    I'd still like to get better information on Hadley's "peels off" metaphor. Does anyone have some references or other posted answers regarding it? – Michael Bellhouse Oct 23 '14 at 01:28
  • 2
    https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html, see section containing the phrase: "each summary peels off one level of the grouping" – Alex Nov 07 '16 at 22:56
11

If you translate your plyr code into dplyr using summarise instead of mutate you get the same results.

library(plyr)
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
df2
##   cyl newvar2
## 1   4  30.143
## 2   6  26.820
## 3   8  60.989

detach(package:plyr)    
library(dplyr)
mtcars %.%
    group_by(cyl, gear) %.%
    summarise(newvar = sum(wt)) %.%
    group_by(cyl) %.%
    summarise(newvar2 = sum(newvar) + 5)
##   cyl newvar2
## 1   4  30.143
## 2   8  60.989
## 3   6  26.820

EDIT

Since summarise drops the last group (gear) you can skip the second group_by (see @hadley comment below)

library(dplyr)
mtcars %.%
    group_by(cyl, gear) %.%
    summarise(newvar = sum(wt)) %.%
    summarise(newvar2 = sum(newvar) + 5)
##   cyl newvar2
## 1   4  30.143
## 2   8  60.989
## 3   6  26.820
dickoa
  • 18,217
  • 3
  • 36
  • 50
  • So the second "group_by()" and "summarise()" calls overwrite the first ones? – Marc Tulla Feb 09 '14 at 01:02
  • 1
    Yes and you can use also `regroup` to enforce that. – dickoa Feb 09 '14 at 05:53
  • 2
    You don't need the second `group_by()` here because summarise automatically drops the last group (the group it collapsed). – hadley Feb 09 '14 at 15:33
  • If you don't want to detach `plyr` for some reason, you can always just specify `dplyr::` in front of the `group_by` and `summarize` functions. – pyll Dec 19 '17 at 16:46
6

Detaching plyr is one way to solve the problem so you can use dplyr functions as desired... but what if you need other functions from plyr to complete other tasks in your code?

(In this example, I've got both dplyr and plyr libraries loaded)

Suppose we have a simple data.frame and we want to compute the groupwise sum of the variable value, when grouped by different levels of gname

> dx<-data.frame(gname=c(1,1,1,2,2,2,3,3,3), value = c(2,2,2,4,4,4,5,6,7))
> dx
  gname value
1     1     2
2     1     2
3     1     2
4     2     4
5     2     4
6     2     4
7     3     5
8     3     6
9     3     7

But when we try to use what we believe will produce a dplyr grouped sum, here's what happens:

dx %>% group_by(gname) %>% mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname

  gname value mysum
1     1     2    36
2     1     2    36
3     1     2    36
4     2     4    36
5     2     4    36
6     2     4    36
7     3     5    36
8     3     6    36
9     3     7    36

It doesn't give us the desired answer. Probably because of some interaction or overloading of the group_by and or mutate functions between dplyr and plyr. We could detach plyr, but another way is to give a unique call to the dplyr versions of group_by and mutate:

dx %>% dplyr::group_by(gname) %>% dplyr::mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname

  gname value mysum
1     1     2     6
2     1     2     6
3     1     2     6
4     2     4    12
5     2     4    12
6     2     4    12
7     3     5    18
8     3     6    18
9     3     7    18

now we see that this works as expected.

5

dplyr is working as you should expect in your example. Mutate, as you specified it, will just add 5 to each value of newvar as it creates newvar2. This would look the same if you group or not. If, however, you specify something that differs by group you will get something different. For example:

df1 %.%
            group_by(cyl) %.%
            mutate(
                newvar2 = newvar + mean(cyl)
            )
Vincent
  • 5,063
  • 3
  • 28
  • 39