understanding difference in results between dplyr group_by vs tapply

Question

I was expecting to see the same results between these two runs, and they are different. Makes me question if I really understand what how the dplyr code is working (I have read pretty much everything I can find about dplyr in the package and online). Can anyone explain why the results are different, or how to obtain similar results?

library(dplyr)
x <- iris
x <- x %.%
    group_by(Species, Sepal.Width) %.%
    summarise (freq=n()) %.%
    summarise (mean_by_group = mean(Sepal.Width))  
print(x)

x <- iris
x <- tapply(x$Sepal.Width, x$Species, mean)
print(x)

Update: I don't think this is the most efficient way to do this, but the following code gives a result that matches the tapply approach. Per Hadley's suggestion, I scrutinized the results line by line, and this is the best I could come up with using dplyr

library(dplyr)
x <- iris
x <- x %.%
    group_by(Species, Sepal.Width) %.%
    summarise (freq=n()) %.%
    mutate (mean_by_group = sum(Sepal.Width*freq)/sum(freq)) %.%
print(x)

Update: for some reason I thought I had to group all variables I wanted to analyse, which is what sent things in the wrong direction. This is all I needed, which is closer to the examples in the package.

x <- iris %.%
    group_by(Species) %.%
    summarise(Sepal.Width = mean(Sepal.Width))
print(x)

Hint: work through your dplyr code line-by-line. – hadley May 27 '14 at 07:20 — hadley, May 27 '14 at 07:20

npjc · Accepted Answer · 2014-05-28T21:49:02.693

Maybe this...

- `dplyr`:

require(dplyr)

iris %>% group_by(Species) %>% summarise(mean_width = mean(Sepal.Width))

  # Source: local data frame [3 x 2]
  #
  #      Species        mean_width
  # 1     setosa             3.428
  # 2 versicolor             2.770
  # 3  virginica             2.974

- `tapply`:

tapply(iris$Sepal.Width, iris$Species, mean)

  # setosa versicolor  virginica 
  # 3.428      2.770      2.974

NOTE: `tapply()` simplifies output by default whereas `summarise()` does not:

typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=TRUE))

  # [1] "double"

it returns a list otherwise:

typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=FALSE))

  # [1] "list"

So to actually get the same type of output form tapply() you would need:

tbl_df( 
  data.frame( 
    mean_width = tapply( iris$Sepal.Width, 
                         iris$Species, 
                         mean )))

  # Source: local data frame [3 x 1]
  #
  #            mean_width
  # setosa          3.428
  # versicolor      2.770
  # virginica       2.974

and this still isn't the same! as unique(iris$Species) is an attribute here and not a column of the df...

This worked and I marked it as the answer, but you might want to add that one needs to load the magrittr package to enable %>%. I was not familiar with that package before. — Michael Bellhouse, May 28 '14 at 21:38
@MichaelBellhouse edited post to include `require(dplyr)` line. dplyr imports a small part of the `magrittr` package upon loading. thanks for the heads up. — npjc, May 28 '14 at 21:50
actually your code would not run for me without magrittr loaded: — Michael Bellhouse, May 29 '14 at 00:20

understanding difference in results between dplyr group_by vs tapply

1 Answers1

- `dplyr`:

- `tapply`:

NOTE: `tapply()` simplifies output by default whereas `summarise()` does not:

Linked

understanding difference in results between dplyr group_by vs tapply

1 Answers1

- dplyr:

- tapply:

NOTE: tapply() simplifies output by default whereas summarise() does not:

Linked

- `dplyr`:

- `tapply`:

NOTE: `tapply()` simplifies output by default whereas `summarise()` does not: