4

Can dplyr perform chained summarise operations on a data.frame?

My data.frame has the structure:

data_df = tbl_df(data)    
data_df %.%
        group_by(col_1) %.%
        summarise(number_of= length(col_2)) %.%
        summarise(sum_of = sum(col_3)) 

This causes RStudio to encounter a fatal error - R Session Aborted message

Usually with plyr I would include these summarise functions without problems.

UPDATE

Data are here.

Code is:

library(dplyr)

orth <- read.csv('orth0106.csv')
orth_df = tbl_df(orth)


orth_df %.%
    group_by(Hospital) %.%
    summarise(Procs = length(Procedure)) %.%
    summarise(SSIs = sum(SSI))
Andrie
  • 176,377
  • 47
  • 447
  • 496
John
  • 41,131
  • 31
  • 82
  • 106
  • Could you provide a reproducible example, to reproduce the error? – marbel Jan 25 '14 at 07:55
  • @martin-bel - data and code now included. – John Jan 25 '14 at 08:06
  • In the future, please file bugs like this directly at github. I've voted to close this issue since it will no longer apply once the next version of dplyr comes out (which will be soon) – hadley Jan 27 '14 at 14:16
  • This question appears to be off-topic because it is a bug report which has been fixed in the development version of the software. – hadley Jan 27 '14 at 14:16

1 Answers1

11

I can reproduce the error on Windows 7 machine running RStudio 0.97.551

It may be because you're calling summarise and chaining onto something that's not there. You can summarise with 2 different columns as I've done here.

url <- "https://raw.github.com/johnmarquess/some.data/master/orth0106.csv"

library(dplyr)

orth <- read.csv(url)
orth_df <- tbl_df(orth)


orth_df %.%
    group_by(Hospital) %.%
    summarise(Procs = length(Procedure), SSIs = sum(SSI))

## Source: local data frame [18 x 3]
## 
##    Hospital Procs SSIs
## 1         A   865   80
## 2         B  1069   38
## 3         C   796   24
## 4         D   891   35
## 5         E   997   39
## 6         F   550   30
## 7         G  2598  128
## 8         H   373   27
## 9         I  1079   70
## 10        J   714   30
## 11        K   477   30
## 12        L   227    2
## 13        M   125    6
## 14        N   589   38
## 15        O   292    3
## 16        P   149    9
## 17        Q  1984   52
## 18        R   351   13

In any event this seems like either an RStudio or a dplyr bug. I'd open up an issue with Hadley as he probably cares either way. https://github.com/hadley/dplyr/issues

EDIT This (your first call) also cause rgui (windows) and the terminal to crash as well on:

R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)

This indicates a dplyr problem Hadley and Romain will want to know about.

To get my first point we run:

orth_df %.%
    group_by(Hospital) %.%
    summarise(Procs = length(Procedure))

Source: local data frame [18 x 2]

   Hospital Procs
1         A   865
2         B  1069
3         C   796
4         D   891
5         E   997
6         F   550
7         G  2598
8         H   373
9         I  1079
10        J   714
11        K   477
12        L   227
13        M   125
14        N   589
15        O   292
16        P   149
17        Q  1984
18        R   351

Where is %.% summarise(SSIs = sum(SSI)) supposed to find SSI?

So the chaining you think is happening fails. TO my understanding %.% isn't exactly like how ggplot2 works but similar. In ggplot2 once you pass the data in the initial mapping you can access it later on. Here %.% seems to modify grab the left chunk and operate on it like this:

enter image description here

So you're grabbing:

   Hospital Procs
1         A   865
2         B  1069
3         C   796
.
.
.
17        Q  1984
18        R   351

when you use %.% summarise(SSIs = sum(SSI)) and there is no SSI to be gotten. So the analogy that comes to mind is serial vs. parallel wiring Christmas lights. %.% = serial ggplot() + = parallel. This is a nonprogrammer's understanding of things and the R gurus may come and tell me I'm stupid but for now that's the best theory you've got.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519