5

I'm interested in finding an efficient manner to obtain a summary by group table that would contain:

  • Count for unique values per group
  • A primitive set of descriptive statistics for selected variables

For example, in case of generating the descriptive statistics I'm using the code below:

data("mtcars")
require(dplyr)
mt_sum <- mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min,max), hp, wt, disp)

which would generate the desired output:

> head(mt_sum)

Source: local data frame [3 x 7]

    cyl hp_min wt_min disp_min hp_max wt_max disp_max
  (dbl)  (dbl)  (dbl)    (dbl)  (dbl)  (dbl)    (dbl)
1     4     52  1.513     71.1    113  3.190    146.7
2     6    105  2.620    145.0    175  3.460    258.0
3     8    150  3.170    275.8    335  5.424    472.0

I'm interested in enriching the data with figure that would reflect count of values per each group. With respect to the count, this can be simply done:

mt_sum2 <- mtcars %>% 
  group_by(cyl) %>% 
  summarise(countObs = n())

which would generate the required data:

> head(mt_sum2)
Source: local data frame [3 x 2]

    cyl countObs
  (dbl)    (int)
1     4       11
2     6        7
3     8       14 

Problem

The problem occurs when I would like to simultaneously apply both transformations.

Attempt 1

For example the code:

mt_sum <- mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min,max), hp, wt, disp) %>% 
  summarise(countObs = n())

would generate:

Source: local data frame [3 x 2]

    cyl countObs
  (dbl)    (int)
1     4       11
2     6        7
3     8       14

without the descriptive statistics that were previously generated.

Attempt 2

The code:

mt_sum <- mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min,max,n), hp, wt, disp)

will expectedly fail:

Error: n does not take arguments

Attempt 3 (working)

The code:

data("mtcars")
require(dplyr)
mt_sum <- mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min,max), hp, wt, disp) %>% 
  left_join(y = data.frame(
    "Var1" = as.numeric(as.character(as.data.frame(table(mtcars$cyl))$Var1)),
    "Count" = as.character(as.data.frame(table(mtcars$cyl))$Freq)),
            by = c("cyl" = "Var1"))

will deliver the required data:

> head(mt_sum)
Source: local data frame [3 x 8]

    cyl hp_min wt_min disp_min hp_max wt_max disp_max  Count
  (dbl)  (dbl)  (dbl)    (dbl)  (dbl)  (dbl)    (dbl) (fctr)
1     4     52  1.513     71.1    113  3.190    146.7     11
2     6    105  2.620    145.0    175  3.460    258.0      7
3     8    150  3.170    275.8    335  5.424    472.0     14

I think this is extremely inefficient way of producing this summary. In particular, creating objects on the fly is inefficient when working with big tables. I'm interested in achieving the same results but in a more efficient manner that would not involve creating objects just for the purpose of merging. In particular, what I would like to do in dplyr would correspond to deriving additional summaries from the previous version of the table. For example:

  1. Group
  2. Produce descriptive statistics
  3. Come back to the data after group
  4. Produce some additional statistics and add to the final data
Konrad
  • 17,740
  • 16
  • 106
  • 167
  • 2
    `summarise_each(funs(min,max,length), hp, wt, disp)` ? – jeremycg Dec 07 '15 at 12:51
  • @jeremycg Thanks for showing the interest. I was thinking of something on those lines `hp_length wt_length disp_length`, I would have to remove those columns via `select`. I don't find the `select` syntax easiest to mange, it would have to correspond to *keep only one column with `length` and drop rest but keep all previous columns for other summaries* or something like that. – Konrad Dec 07 '15 at 12:56
  • 2
    you could use `select(-grep("length", names(.))[-1])`. The timing is about the same as the left_join answer below - it will depend on the size and scale of your data – jeremycg Dec 07 '15 at 13:09
  • @jeremycg Thanks very much for this, it's a neat solution. – Konrad Dec 07 '15 at 13:10

1 Answers1

3

Here's another (shorter) option using a left_join:

mtcars %>% 
    group_by(cyl) %>%  
    summarise_each(funs(min,max), hp, wt, disp) %>% 
    left_join(count(mtcars, cyl))
#Joining by: "cyl"
#Source: local data frame [3 x 8]
#
#    cyl hp_min wt_min disp_min hp_max wt_max disp_max     n
#  (dbl)  (dbl)  (dbl)    (dbl)  (dbl)  (dbl)    (dbl) (int)
#1     4     52  1.513     71.1    113  3.190    146.7    11
#2     6    105  2.620    145.0    175  3.460    258.0     7
#3     8    150  3.170    275.8    335  5.424    472.0    14
talat
  • 68,970
  • 21
  • 126
  • 157