5

I would like to be able to use dplyr's split-apply-combine strategy to the apply the summary() command.

Take a simple data frame:

df <- data.frame(class = c('A', 'A', 'B', 'B'),
                 value = c(100, 120, 800, 880))

Ideally we would do something like this:

df %>%
  group_by(class) %>%
  do(summary(.$value))

Unfortunately this does not work. Any ideas?

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Bastiaan Quast
  • 2,802
  • 1
  • 24
  • 50

3 Answers3

5

You can use the SE version of data_frame, that is, data_frame_ and perform:

df %>%
  group_by(class) %>%
  do(data_frame_(summary(.$value)))

Alternatively, you can use as.list() wrapped by data.frame() with the argument check.names = FALSE:

df %>%
  group_by(class) %>%
  do(data.frame(as.list(summary(.$value)), check.names = FALSE))

Both versions produce:

# Source: local data frame [2 x 7]
# Groups: class [2]
# 
#    class  Min. 1st Qu. Median  Mean 3rd Qu.  Max.
#   (fctr) (dbl)   (dbl)  (dbl) (dbl)   (dbl) (dbl)
# 1      A   100     105    110   110     115   120
# 2      B   800     820    840   840     860   880
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • Thanks, this output looks perfect. I've read about SE a bit but never quite understood it. Which package do these functions belong to? From the use of `_` it looks like one of @hadley. I also found a way to use `tidy()` from the broom package. See below. – Bastiaan Quast Mar 28 '16 at 14:31
  • 3
    `data_frame` and `data_frame_` come from `dplyr`. This answer deserves the checkmark as far as I'm concerned, by the way. – Axeman Mar 28 '16 at 14:32
  • Thanks, that makes sense. And thank you for volunteering that, I changed it. – Bastiaan Quast Mar 28 '16 at 15:29
4

The problem is that dplyr's do() only works with with input of the form data.frame.

The broom package's tidy() function can be used to convert outputs of summary() to data.frame.

df %>%
  group_by(class) %>%
  do( tidy(summary(.$value)) )

This gives:

Source: local data frame [2 x 7]
Groups: class [2]

   class minimum    q1 median  mean    q3 maximum
  (fctr)   (dbl) (dbl)  (dbl) (dbl) (dbl)   (dbl)
1      A     100   105    110   110   115     120
2      B     800   820    840   840   860     880
Bastiaan Quast
  • 2,802
  • 1
  • 24
  • 50
3

The behavior of do will change depending on whether you give it a named or unnamed argument. For unnamed arguments, it expects a data.frame for each group, which will be binded together. For named arguments it will make a row for each group, and put whatever the output is into a new variable with that name.

So in this case we it will complain for unnamed use (summary does not produce a data.frame) but the named use will work:

df %>%
  group_by(class) %>%
  do(summaries = summary(.$value)) ->
  df2

Which gives:

Source: local data frame [2 x 2]
Groups: <by row>

   class                  summaries
  (fctr)                      (chr)
1      A <S3:summaryDefault, table>
2      B <S3:summaryDefault, table>

We can access a summary like this:

df2$summaries[[1]]

Giving:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
100     105     110     110     115     120 

Getting all of these as new columns for df can only be done by first converting the output to a data.frame, as can be seen in the other answers.

So the root of the problem here is that summary outputs a table instead of a data.frame.

Axeman
  • 32,068
  • 8
  • 81
  • 94
  • 1
    thanks, this is great. Another way I just came across would be to use the `tidy()` function from the broom package. But naming it is a very simple way to avoid that. – Bastiaan Quast Mar 28 '16 at 14:07