23

What is happening in the first line of code and why does the result differ from the two next results?

library(tidyverse)
library(magrittr)

data.frame(A=c(2,2),B=c(1,1)) %>%
   summarise(A = sum(A),B = sum(B), D=sum(A)-sum(B))

yields D=0

data.frame(A=c(2,2),B=c(1,1)) %>%
   summarise(A = sum(A),B = sum(B), D=sum(A-B) )

yields in D=2

data.frame(A=c(2,2),B=c(1,1)) %>% 
  summarise(sum_A = sum(A),sum_B = sum(B), D=sum(A)-sum(B))

yields in D=2.

I can't seem to come up with an explanation to how the D=0 can be a result of such an operation. How can D=0 be a sensible result?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Martin
  • 331
  • 2
  • 10
  • 1
    Interesting (+1) also take a look at `data.frame(A=c(2,2),B=c(1,1)) %>% summarise(A = sum(A), B = sum(B), D=sum(A), E = sum(B))` – talat Nov 30 '17 at 10:47
  • `mutate` seems to work fine. `data.frame(A=c(2,2),B=c(1,1)) %>% mutate(A = sum(A), B = sum(B), D=sum(A))` – Sotos Nov 30 '17 at 10:52
  • 1
    From the definition of `summarise`, *summarise() is typically used on grouped data created by group_by(). The output will have one row for each group.* So maybe it happens because the data frame is not grouped? Whereas `mutate` which does not need groups works as expected – Sotos Nov 30 '17 at 10:54
  • @Sotos, it should normally work just fine with ungrouped data, too. It looks like a bug to me – talat Nov 30 '17 at 10:54
  • From `?summarise` - "*# A summary applied to ungrouped tbl returns a single row*" - it should work as documented behaviour. The issue also persists using a `tibble` instead of a `data.frame` – thelatemail Nov 30 '17 at 10:55
  • @docendodiscimus Definitely. I m not saying It should work like that. – Sotos Nov 30 '17 at 10:57
  • The smallest example I can do that breaks it - `df %>% summarise(A=sum(A),B=sum(A))` – thelatemail Nov 30 '17 at 11:00
  • Looks like a bug. Caused by applying a function to a (single) variable already updated within `summarise`. Interestingly seems to work when you combine variables within a function (?) : `data.frame(A=c(2,2),B=c(1,1)) %>% summarise(A = sum(A), B = sum(B), C = A, D = sum(A), E = mean(A), G = sum(A-B), H = mean(A-B))` – AntoniosK Nov 30 '17 at 11:02
  • 13
    This is a bug, I have filed an issue at https://github.com/tidyverse/dplyr/issues/3233 – Lionel Henry Nov 30 '17 at 11:13
  • Thought so too. Thanks for the quick replies. – Martin Nov 30 '17 at 11:30
  • I recommend using your last approach of renaming things. Another option is to use summarise (to compute sum of A and B) and then mutate (to compute the difference) – Felipe Gerard Feb 07 '18 at 18:52
  • The order is also important... summarise(D=sum(A)-sum(B),A = sum(A),B = sum(B)) yields 2 4 2 – mysteRious Mar 21 '18 at 03:44

1 Answers1

1

It is a bug, see https://github.com/tidyverse/dplyr/issues/3233. It is fixed in 0.7.4.9001.

trinalbadger587
  • 1,905
  • 1
  • 18
  • 36