I have a data.frame of the size 75 million x 36, { 75 million rows) , where the cols are
col1, col1_decile, col2, col2_decile ........... col18 , col18_decile
Now I want to get summary statistics ( min, max, mean and standard deviation) corresponding for each of the columns col1,col2 ....... col18 grouped by their deciles.
i.e. summary statistics of
col1 by col1_decile, of col2 by col2_decile , of col3 by col3_decile ......, of col18 by col18_decile
For a reproducible example , I will do with the mtcars dataset:
library(dplyr)
data("mtcars")
mtcars %>% mutate_all(funs(decile = ntile(., 10))) -> mtcars_deciled
head(mtcars_deciled)
Here the columns are
mpg,cyl, disp, hp, drat,wt,qsec, vs, am, gear, carb,mpg_decile, cyl_decile, disp_decile, hp_decile, drat_decile,wt_decile qsec_decile, vs_decile, am_decile, gear_decile,carb_decile
I want the final data.frame to look like
decile mpg_decile_min mpg_decile_max mpg_decile_mean mpg_decile_sd ...
and so on for all the columns.
Each min, max, mean , std. deviation will be calculated based on the corresponding decile column
since it is a huge dataset of 75 million rows, I'm looking for fast solutions. I have tinkered with seplyr
in R! , but didn't get far.
Fast solutions with data.table
or dplyr
or seplyr
would be appreciated. The final data.frame should have 10 rows and 73 columns ( 4 summary statistic columns for min,max , mean and sd for each deciled columns ( 18 decile columns) and the common decile group column
decile mpg_decile_min mpg_decile_max mpg_decile_mean mpg_decile_sd .... carb_decile_min carb_decile_max carb_decile_mean carb_decile_d