I'm interested in finding an efficient manner to obtain a summary by group table that would contain:
- Count for unique values per group
- A primitive set of descriptive statistics for selected variables
For example, in case of generating the descriptive statistics I'm using the code below:
data("mtcars")
require(dplyr)
mt_sum <- mtcars %>%
group_by(cyl) %>%
summarise_each(funs(min,max), hp, wt, disp)
which would generate the desired output:
> head(mt_sum)
Source: local data frame [3 x 7]
cyl hp_min wt_min disp_min hp_max wt_max disp_max
(dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 4 52 1.513 71.1 113 3.190 146.7
2 6 105 2.620 145.0 175 3.460 258.0
3 8 150 3.170 275.8 335 5.424 472.0
I'm interested in enriching the data with figure that would reflect count of values per each group. With respect to the count, this can be simply done:
mt_sum2 <- mtcars %>%
group_by(cyl) %>%
summarise(countObs = n())
which would generate the required data:
> head(mt_sum2)
Source: local data frame [3 x 2]
cyl countObs
(dbl) (int)
1 4 11
2 6 7
3 8 14
Problem
The problem occurs when I would like to simultaneously apply both transformations.
Attempt 1
For example the code:
mt_sum <- mtcars %>%
group_by(cyl) %>%
summarise_each(funs(min,max), hp, wt, disp) %>%
summarise(countObs = n())
would generate:
Source: local data frame [3 x 2]
cyl countObs
(dbl) (int)
1 4 11
2 6 7
3 8 14
without the descriptive statistics that were previously generated.
Attempt 2
The code:
mt_sum <- mtcars %>%
group_by(cyl) %>%
summarise_each(funs(min,max,n), hp, wt, disp)
will expectedly fail:
Error: n does not take arguments
Attempt 3 (working)
The code:
data("mtcars")
require(dplyr)
mt_sum <- mtcars %>%
group_by(cyl) %>%
summarise_each(funs(min,max), hp, wt, disp) %>%
left_join(y = data.frame(
"Var1" = as.numeric(as.character(as.data.frame(table(mtcars$cyl))$Var1)),
"Count" = as.character(as.data.frame(table(mtcars$cyl))$Freq)),
by = c("cyl" = "Var1"))
will deliver the required data:
> head(mt_sum)
Source: local data frame [3 x 8]
cyl hp_min wt_min disp_min hp_max wt_max disp_max Count
(dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (fctr)
1 4 52 1.513 71.1 113 3.190 146.7 11
2 6 105 2.620 145.0 175 3.460 258.0 7
3 8 150 3.170 275.8 335 5.424 472.0 14
I think this is extremely inefficient way of producing this summary. In particular, creating objects on the fly is inefficient when working with big tables. I'm interested in achieving the same results but in a more efficient manner that would not involve creating objects just for the purpose of merging. In particular, what I would like to do in dplyr
would correspond to deriving additional summaries from the previous version of the table. For example:
- Group
- Produce descriptive statistics
- Come back to the data after group
- Produce some additional statistics and add to the final data