Summary statistics grouped by multiple columns on specific columns corresponding to them

Question

I have a data.frame of the size 75 million x 36, { 75 million rows) , where the cols are

col1, col1_decile, col2, col2_decile ........... col18 , col18_decile

Now I want to get summary statistics ( min, max, mean and standard deviation) corresponding for each of the columns col1,col2 ....... col18 grouped by their deciles.

i.e. summary statistics of

col1 by col1_decile, of col2 by col2_decile , of col3 by col3_decile ......, of col18 by col18_decile

For a reproducible example , I will do with the mtcars dataset:

library(dplyr)
data("mtcars")
mtcars %>% mutate_all(funs(decile = ntile(., 10))) -> mtcars_deciled

head(mtcars_deciled)

Here the columns are

mpg,cyl, disp, hp, drat,wt,qsec, vs, am, gear, carb,mpg_decile, cyl_decile, disp_decile, hp_decile, drat_decile,wt_decile qsec_decile, vs_decile, am_decile, gear_decile,carb_decile

I want the final data.frame to look like

decile mpg_decile_min mpg_decile_max mpg_decile_mean mpg_decile_sd ...

and so on for all the columns.

Each min, max, mean , std. deviation will be calculated based on the corresponding decile column

since it is a huge dataset of 75 million rows, I'm looking for fast solutions. I have tinkered with seplyr in R! , but didn't get far.

Fast solutions with data.table or dplyr or seplyr would be appreciated. The final data.frame should have 10 rows and 73 columns ( 4 summary statistic columns for min,max , mean and sd for each deciled columns ( 18 decile columns) and the common decile group column

decile mpg_decile_min mpg_decile_max mpg_decile_mean mpg_decile_sd .... carb_decile_min carb_decile_max carb_decile_mean carb_decile_d

I'm only seeing 11 decile columns, where are the other 7 to make the 18 cited? — www, Sep 05 '17 at 18:13
You should just phrase the whole thing in terms of the provided example, maybe noting at the end that your real data frame is similar but has dimensions x by y (since your current approach can lead to confusion). Also, of course, you should provide the expected output for the given example, see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 — Frank, Sep 05 '17 at 18:18
@RyanRunge, my real dataset has 18 columns , for the example , I gave with the mtcars dataset, its only 11*2 = 22 columns. — ML_Passion, Sep 05 '17 at 18:26

Gilles San Martin · Answer 1 · 2017-09-05T23:51:12.447

Here is one possibility only with data.table.

The problem is the structure of the dataset with a mix of variables types on the same line (decile and measures). You have to reorganise it to make the aggregation easier.

The following example might be slow on big dataset (grepl, gsub, ifelse,... ??) and can probably be optimized.There are also multiple copies of the whole dataset. Maybe piping each command into the next wone might be better ? Advices welcome...

library(data.table)
library(dplyr)
data("mtcars")

# Your example in data.table format
DT <- as.data.table(mtcars %>% mutate_all(funs(decile = ntile(., 10))))

# Add an ID for each row
DT[,ID := 1:nrow(DT)]

# Transform the dataset in "long" format
tmp <- melt(DT, id.vars = "ID")

# Create a variable to make the distinction between the decile values and the 
# measurements. Maybe not optimal for speed ?
tmp[, decile := ifelse(grepl("_decile$", variable), "decile", "value")]

# Remove the "_decile" suffix
tmp[, variable := gsub("_decile$", "", variable)]

# Cross table to have for each observation, the type of variable, the decile and the value
tmp <- dcast(tmp, ID + variable ~ decile)

# Now it is quite straightforward to compute your summary statistics with data.table syntax
result <- tmp[, .(min = min(value), max = max(value), mean = mean(value), sd = sd(value)), 
    keyby = .(variable, decile)]

print(result, 10)
##      variable decile   min   max     mean         sd
##   1:       am      1 0.000 0.000 0.000000 0.00000000
##   2:       am      2 0.000 0.000 0.000000 0.00000000
##   3:       am      3 0.000 0.000 0.000000 0.00000000
##   4:       am      4 0.000 0.000 0.000000 0.00000000
##   5:       am      5 0.000 0.000 0.000000 0.00000000
##   6:       am      6 0.000 1.000 0.250000 0.50000000
##   7:       am      7 1.000 1.000 1.000000 0.00000000
##   8:       am      8 1.000 1.000 1.000000 0.00000000
##   9:       am      9 1.000 1.000 1.000000 0.00000000
##  10:       am     10 1.000 1.000 1.000000 0.00000000
##  ---                                                
## 101:       wt      1 1.513 1.935 1.724500 0.19428759
## 102:       wt      2 2.140 2.320 2.220000 0.09165151
## 103:       wt      3 2.465 2.770 2.618333 0.15250683
## 104:       wt      4 2.780 3.150 2.935000 0.19215879
## 105:       wt      5 3.170 3.215 3.191667 0.02254625
## 106:       wt      6 3.435 3.440 3.438750 0.00250000
## 107:       wt      7 3.460 3.570 3.516667 0.05507571
## 108:       wt      8 3.570 3.780 3.693333 0.10969655
## 109:       wt      9 3.840 4.070 3.918333 0.13137098
## 110:       wt     10 5.250 5.424 5.339667 0.08712252

Here is a piped version of the same code :

result <- 
    DT[,ID := 1:nrow(DT)] %>% 
    melt(id.vars = "ID") %>% 
    .[, decile := ifelse(grepl("_decile$", variable), "decile", "value")] %>% 
    .[, variable := gsub("_decile$", "", variable)] %>% 
    dcast(ID + variable ~ decile) %>% 
    .[, .(min = min(value), max = max(value), mean = mean(value), sd = sd(value)), 
     keyby = .(variable, decile)]

I appreciate your solution, I'll test it to see which of the above solutions is fast enough to work with my dataset of 75-110 million rows. Actually my original dataset has an id column, so I could eliminate that step. — ML_Passion, Sep 06 '17 at 15:15

CPak · Answer 2 · 2017-09-06T14:08:43.450

tidyverse solution

Using mtcars_deciled as the data. Replace mtcars with yourdata in the following solution to apply to your situation. This assumes _decile columns are a fixed width apart from the parent column.

library(tidyverse)
numcol <- ncol(mtcars)
ans <- map2(seq_len(numcol), names(mtcars), ~mtcars_deciled[,c(.x, .x+numcol)] %>%
                                          group_by_at(vars(dplyr::contains("decile"))) %>%
                                          summarise_at(vars(.y), funs(mean, sd, min, max)))

Note dplyr::contains is necessary to disambiguate it from purrr::contains

output

This will result in a list of data frames

[[1]]
# A tibble: 10 x 5
   mpg_decile mean_mpg min_mpg max_mpg     sd_mpg
        <int>    <dbl>   <dbl>   <dbl>      <dbl>
 1          1 12.10000    10.4    14.3 2.00499377
 2          2 14.96667    14.7    15.2 0.25166115
 3          3 15.50000    15.2    15.8 0.30000000
 4          4 17.16667    16.4    17.8 0.70945989
 5          5 18.66667    18.1    19.2 0.55075705
 6          6 20.22500    19.2    21.0 0.91787799
 7          7 21.43333    21.4    21.5 0.05773503
 8          8 23.33333    22.8    24.4 0.92376043
 9          9 27.90000    26.0    30.4 2.26053091
10         10 32.23333    30.4    33.9 1.75594229

[[2]]
# A tibble: 10 x 5
   cyl_decile mean_cyl min_cyl max_cyl   sd_cyl
        <int>    <dbl>   <dbl>   <dbl>    <dbl>
 1          1 4.000000       4       4 0.000000
 2          2 4.000000       4       4 0.000000
 3          3 4.000000       4       4 0.000000
 4          4 5.333333       4       6 1.154701
 5          5 6.000000       6       6 0.000000
 6          6 7.000000       6       8 1.154701
 7          7 8.000000       8       8 0.000000
 8          8 8.000000       8       8 0.000000
 9          9 8.000000       8       8 0.000000
10         10 8.000000       8       8 0.000000

# etc

Note that `summarise_` and the `_` family [is on it's way out](http://dplyr.tidyverse.org/reference/se-deprecated.html). — Axeman, Sep 06 '17 at 07:34
@CPak For this part of the code `group_by_at(vars(dplyr::contains("decile")))`, I get an error : `Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "quosures"` Still thanks for your solution. — ML_Passion, Sep 06 '17 at 14:01
@ML_Passion btw, I updated my code since the original post. Could you try with the newest version? (which includes `numcol <- ncol(mtcars)`) — CPak, Sep 06 '17 at 14:35

Summary statistics grouped by multiple columns on specific columns corresponding to them

2 Answers2

tidyverse solution

output