7

I was very shocked by the smoothness of dplyr package in flow-style data processing. Recently I rush into a problem to generate a new data frame for each group ID and combine those small data frames into a final larger data frame. A toy example:

input.data.frame %>%
    group_by(gid) %>%
    {some operation to generate a new data frame for each group} ## FAILED!!!!

In dplyr, the function mutate adding new column to each group and summarise generating summaries for each group, both can not fulfill my requirement. (Did I miss something?)

Alternatively, using ddply of plyr package, the previous interation of dplyr, I can make it via

ddply(input.data.frame, .(gid), function(x) {
     some operation to generate a new data frame for each group
}

But the shortage is some funcs in dplyr will be masked from availableness when I load the plyr package.

Cœur
  • 37,241
  • 25
  • 195
  • 267
caesar0301
  • 1,913
  • 2
  • 22
  • 24
  • 2
    You have to use the `do` operator in such a case. However, it would be better if you showed us what you really want to do / achieve in the end. From the help file: "You can use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list." – talat Nov 07 '14 at 08:14
  • Nice introduction "I was very shocked by the smoothness of dplyr package in flow-style data processing." :) – talat Nov 07 '14 at 08:19
  • 1
    And by the way, if you load both packages (plyr and dplyr) the recommendation is to load plyr first and then dplyr, so the "standard" package for e.g. "summarise" would be dplyr, but if you need it from plyr, just use `plyr::summarise` to make the package explicit. – talat Nov 07 '14 at 08:25
  • `do`, that is what I am looking for!! A really general operation. Thanks guy. :) – caesar0301 Nov 07 '14 at 08:34

2 Answers2

7

Here is an example following the answer by G. Grothendieck to a similar question. Adding rows in `dplyr` output

First we generate a data frame with x and g. There are 9 random numbers in x and 3 groups a,b,c in g. We want to select 2 largest numbers from each group. It is important to remember that do requires a data frame as return value.

library(dplyr)
set.seed(1)
dat <- data.frame(x=runif(9),g=rep(letters[1:3],each=3))

dat
      x g
1 0.1765568 a
2 0.6870228 a
3 0.3841037 a
4 0.7698414 b
5 0.4976992 b
6 0.7176185 b
7 0.9919061 c
8 0.3800352 c
9 0.7774452 c

## this works
dat %>% dplyr::group_by( g ) %>% do( data.frame(x=tail(sort(.$x),2)) )

## this works too
dat %>% dplyr::group_by( g ) %>% do( .[tail(order(.$x),2),] )

          x      g
      (dbl) (fctr)
1 0.3841037      a
2 0.6870228      a
3 0.7176185      b
4 0.7698414      b
5 0.7774452      c
6 0.9919061      c

## no error, but x is treated as a 1x1 data frame
dat %>% dplyr::group_by( g ) %>% do( x=tail(sort(.$x),2) )
       g        x
  (fctr)    (chr)
1      a <dbl[2]>
2      b <dbl[2]>
3      c <dbl[2]>

## you need a function to do more complicated stuff 
top2x <- function(df) { df[tail(order(df$x),2),] }
dat %>% dplyr::group_by( g ) %>% do( top2x(.) )
Community
  • 1
  • 1
YH Wu
  • 465
  • 6
  • 6
3

Turning my comment into an answer..

Yes, dplyr offers a way to create data.frames for each group. Using the do operator on a grouped data.frame / tbl will let you do this, more precisely, it lets you apply arbitrary functions to each group. This is documented in the help file for do:

[...] You can use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list. This is particularly useful when working with models: you can fit models per group with do and then flexibly extract components with either another do or summarise.

My experience so far is that whenever it is possible to use one of the specialised dplyr functions like mutate / summarise / mutate_each / etc., they should be preferred over do, because they are often more efficient than the use of do, but of course not as flexible.

talat
  • 68,970
  • 21
  • 126
  • 157
  • 4
    Could you please provide an answer to your question that would show a code snippet on how to use do to do this? Thanks – MartinT Nov 17 '15 at 12:31
  • @user2731872, just take a look at the examples section provided in the help page of `?do`. Or provide a minimal example of your problem, but then it would be better if you ask a new question – talat Nov 17 '15 at 13:57
  • Thanks - I did and I am none the wiser. The result of the examples shown result in a grouped_df, not in a list of data frames, which is what the original question was here, I thought:` by_cyl <- group_by(mtcars, cyl); do(by_cyl, head(., 2))` results in a grouped df. I want a list of dfs. I asked the question here now: [link](http://stackoverflow.com/questions/33775239/emulate-split-with-dplyr-group-by-return-a-list-of-data-frames) – MartinT Nov 18 '15 at 08:42
  • @user2731872, dplyr is designed to work with tabular data like `data.frame`s, `data.table`s, `tbl_df`s etc, not for lists. The point is that because of grouping functionality in dplyr, it's usually not necessary to do an explicit `split` as might be necessary when using only base R. – talat Nov 18 '15 at 09:42
  • Without an actual code snippet it is very difficult to understand how to accomplish what you are saying. – CaffeineConnoisseur Mar 17 '17 at 18:20