Descriptive statistics on grouped and ungrouped observations

Question

I have recently migrated to R from Stata. I am unsure how to perform compute descriptive statistics on grouped and ungrouped observations.

Here's my data:

dput(DF)
structure(list(Product_Name = c("iPhone", "iPhone", "iPhone", 
"iPhone", "iPhone", "iPhone", "Nexus 6P", "Nexus 6P", "Nexus 6P", 
"Nexus 6P", "Nexus 6P", "Nexus 6P"), Product_Type = c("New", 
"New", "Refurbished", "New", "New", "Refurbished", "Refurbished", 
"Refurbished", "Refurbished", "Refurbished", "Refurbished", "Refurbished"
), Year = c(2006, 2011, 2009, 2008, 2011, 2009, 2012, 2007, 2013, 
2015, 2009, 2010), Units = c(100, 200, 300, 400, 500, 600, 700, 
200, 120, 125, 345, 340)), .Names = c("Product_Name", "Product_Type", 
"Year", "Units"), row.names = c(NA, 12L), class = "data.frame")

My data has products sold by year and type. Each product could a refurbished product or a new product. Further, if it was sold before 2010, I would mark it as sold in "Time 1" otherwise I would mark it as sold in "Time 2".

Here's my code for this:

DF[DF$Year<2010,"Time"]<-"1"
DF[DF$Year>=2010,"Time"]<-"2"

Now, I want to generate descriptive statistics for these time periods.

DF %>% 
  group_by(Product_Name, Product_Type,Time) %>%
  dplyr::summarise(Count = n(), 
                   Sum_Units = sum(Units,na.rm=TRUE), 
                   Avg_Units = mean(Units,na.rm = TRUE), 
                   Max_Units=max(Units,na.rm = TRUE))

If we run above code, we would get descriptive statistics by Product_Name, Product_Type, and Time (i.e. grouped descriptive statistics). However, this is not what I want. I want descriptive statistics with and without considering groupings with Product_Type and Time. Meaning, I would want to compute descriptive statistics assuming that products were sold in Time 1 OR Time 2 (i.e. all years) and irrespective of the type of product sold, while retaining some of the grouped information above.

Expected output:

dput(DFOut)
structure(list(Product_Name = c("iPhone", "Nexus 6P"), New_Units_Sum_Time1 = c(500, 
NA), Refurbished_Units_Sum_Time_1 = c(900, 545), Sum_Units_Time1 = c(1400, 
545), Sum_Units_Time2 = c(700, 1285), Sum_Units_Time_1_And_2 = c(2100, 
1830), Avg_Units_Time1 = c(350, 272.5), Avg_Units_Time2 = c(350, 
321.25), Avg_Units_Time_1_And_2 = c(350, 305), Max_Units_Time1 = c(600, 
345), Max_Units_Time2 = c(500, 700), Max_Units_Time_1_And_2 = c(600, 
700)), .Names = c("Product_Name", "New_Units_Sum_Time1", "Refurbished_Units_Sum_Time_1", 
"Sum_Units_Time1", "Sum_Units_Time2", "Sum_Units_Time_1_And_2", 
"Avg_Units_Time1", "Avg_Units_Time2", "Avg_Units_Time_1_And_2", 
"Max_Units_Time1", "Max_Units_Time2", "Max_Units_Time_1_And_2"
), row.names = 1:2, class = "data.frame")

In the output, you would see that I have some descriptive statistics:

a) based on the type of product and the period it was sold (e.g. New_Units_Sum_Time1 i.e. New and Time1). Please note that in the output, I have only shown the combination of New and Time1. If you can guide me how to produce descriptive statistics for other combinations of Refurbished and Time, it would be awesome.

b) based on ignoring type of product but not ignoring the period sold (e.g. Sum_Units_Time1 i.e. Time1)

c) based on ignoring both type of product and period it was sold (e.g. Sum_Units_Time_1_And_2).

Ditto for Avg and Mean.

How can I do this? I'd appreciate any help. I am really struggling on this.

Please note that I manually created DFOut using Excel. Although I triple checked it, there could be some manual errors--I will be more than happy to clarify them if there are questions. Thank you for your time.

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grDevices datasets  stats     graphics  grid      tcltk     utils     methods   base     

other attached packages:
 [1] tables_0.8              Hmisc_4.0-2             Formula_1.2-1           survival_2.40-1        
 [5] ResourceSelection_0.3-0 magrittr_1.5            stringr_1.1.0           bit64_0.9-5            
 [9] bit_1.1-12              tufterhandout_1.2.1     knitr_1.15.1            rmarkdown_1.3          
[13] tufte_0.2               corrplot_0.77           purrr_0.2.2             readr_1.0.0            
[17] tibble_1.2              tidyverse_1.1.1         cowplot_0.7.0           plotly_4.5.6           
[21] ggplot2_2.2.1           maps_3.1.1              directlabels_2015.12.16 tidyr_0.6.1            
[25] ggthemes_3.3.0          R2HTML_2.3.2            lubridate_1.6.0         xts_0.9-7              
[29] zoo_1.7-14              lattice_0.20-34         corrgram_1.10           hexbin_1.27.1          
[33] sm_2.2-5.4              compare_0.2-6           installr_0.18.0         psych_1.6.12           
[37] reshape2_1.4.2          readstata13_0.8.5       pastecs_1.3-18          boot_1.3-18            
[41] vcd_1.4-3               car_2.1-4               xlsxjars_0.6.1          rJava_0.9-8            
[45] debug_1.3.1             dplyr_0.5.0             foreign_0.8-67          gmodels_2.16.2         
[49] openxlsx_4.0.0          plyr_1.8.4             

loaded via a namespace (and not attached):
 [1] minqa_1.2.4         colorspace_1.3-2    class_7.3-14        modeltools_0.2-21   mclust_5.2.2       
 [6] rprojroot_1.2       htmlTable_1.9       base64enc_0.1-3     MatrixModels_0.4-1  flexmix_2.3-13     
[11] mvtnorm_1.0-5       xml2_1.1.1          codetools_0.2-15    splines_3.3.2       mnormt_1.5-5       
[16] robustbase_0.92-7   jsonlite_1.2        nloptr_1.0.4        pbkrtest_0.4-6      broom_0.4.1        
[21] cluster_2.0.5       kernlab_0.9-25      httr_1.2.1          backports_1.0.5     assertthat_0.1     
[26] Matrix_1.2-7.1      lazyeval_0.2.0      acepack_1.4.1       htmltools_0.3.5     quantreg_5.29      
[31] tools_3.3.2         gtable_0.2.0        Rcpp_0.12.9         trimcluster_0.1-2   gdata_2.17.0       
[36] nlme_3.1-128        iterators_1.0.8     fpc_2.1-10          lmtest_0.9-34       lme4_1.1-12        
[41] rvest_0.3.2         gtools_3.5.0        dendextend_1.4.0    DEoptimR_1.0-8      MASS_7.3-45        
[46] scales_0.4.1        TSP_1.1-4           hms_0.3             parallel_3.3.2      SparseM_1.74       
[51] RColorBrewer_1.1-2  gridExtra_2.2.1     rpart_4.1-10        latticeExtra_0.6-28 stringi_1.1.2      
[56] gclus_1.3.1         mvbutils_2.7.4.1    foreach_1.4.3       checkmate_1.8.2     seriation_1.2-1    
[61] caTools_1.17.1      prabclus_2.2-6      bitops_1.0-6        evaluate_0.10       htmlwidgets_0.8    
[66] R6_2.2.0            gplots_3.0.1        DBI_0.5-1           haven_1.0.0         whisker_0.3-2      
[71] mgcv_1.8-16         nnet_7.3-12         modelr_0.1.0        KernSmooth_2.23-15  viridis_0.3.4      
[76] readxl_0.1.1        data.table_1.10.0   forcats_0.2.0       digest_0.6.12       diptest_0.75-7     
[81] stats4_3.3.2        munsell_0.4.3       registry_0.3        viridisLite_0.1.3   quadprog_1.5-5

So basically you need different grouping variable each time...? A series of `group_by %>% ... %>% ungroup() ....%>%... group_by..` ? — Sotos, Jan 30 '17 at 07:39
@Sotos - Thanks again for your help, as always. Yes, the only clarification I would add is that there are multiple layers: descriptive statistics by a) grouped based on product name, time and type b) grouped based on product name, time c) grouped based on product name. Does that help? — watchtower, Jan 30 '17 at 07:43

Sotos · Answer 1 · 2017-01-30T08:13:59.380

2

One way to automate this is to first create a vector (ind) with all possible combinations of your grouping variables. We then take those combinations and convert them to formula with Units. As each formula is saved in a list (l1), we iterate over that list and aggregate.

ind <- unlist(sapply(c(2,3), function(i) combn(c('Product_Name', 'Product_Type', 'Time'), 
                                                             i, paste, collapse = '+')))

l1 <- sapply(ind, function(i) as.formula(paste('Units ~ ', i)))

lapply(l1, function(i) aggregate(i, df, FUN = function(j) c(sum1 = sum(j), 
                                                            avg = mean(j), 
                                                            max_units = max(j))))

#which gives

#$`Product_Name+Product_Type`
#  Product_Name Product_Type Units.sum1 Units.avg Units.max_units
#1       iPhone          New       1200       300             500
#2       iPhone  Refurbished        900       450             600
#3     Nexus 6P  Refurbished       1830       305             700

#$`Product_Name+Time`
#  Product_Name Time Units.sum1 Units.avg Units.max_units
#1       iPhone    1    1400.00    350.00          600.00
#2     Nexus 6P    1     545.00    272.50          345.00
#3       iPhone    2     700.00    350.00          500.00
#4     Nexus 6P    2    1285.00    321.25          700.00

#$`Product_Type+Time`
#  Product_Type Time Units.sum1 Units.avg Units.max_units
#1          New    1     500.00    250.00          400.00
#2  Refurbished    1    1445.00    361.25          600.00
#3          New    2     700.00    350.00          500.00
#4  Refurbished    2    1285.00    321.25          700.00

#$`Product_Name+Product_Type+Time`
#  Product_Name Product_Type Time Units.sum1 Units.avg Units.max_units
#1       iPhone          New    1     500.00    250.00          400.00
#2       iPhone  Refurbished    1     900.00    450.00          600.00
#3     Nexus 6P  Refurbished    1     545.00    272.50          345.00
#4       iPhone          New    2     700.00    350.00          500.00
#5     Nexus 6P  Refurbished    2    1285.00    321.25          700.00

edited Jan 30 '17 at 08:13

answered Jan 30 '17 at 08:07

Sotos

51,121
6
32
66

Thanks Sotos...I changed your `df` to `DF1`, and I get this error: `Error in eval(expr, envir, enclos) : object 'Units' not found` . Here's traceback:`10. eval(expr, envir, enclos) 9.eval(predvars, data, env) 8.model.frame.default(formula = i, data = DF1) 7.stats::model.frame(formula = i, data = DF1) 6.eval(expr, envir, enclos) 5.eval(m, parent.frame()) 4.aggregate.formula(i, DF1, FUN = function(j) c(sum1 = sum(j), avg = mean(j), max_units = max(j))) 3.aggregate(i, DF1, ..2.FUN(X[[i]], ...) 1. lapply(l1, function(i) aggregate(i, DF1, .. – watchtower Jan 30 '17 at 08:15
hmm...don't know. Weird. It works perfect here. Is everything you are using up to date? – Sotos Jan 30 '17 at 08:18
I have added my packages. I hope something is not wrong with my R install. – watchtower Jan 30 '17 at 08:24
Ok. Got it. `DF1` should be `DF` in my code. Nothing wrong with your code. My bad. – watchtower Jan 30 '17 at 08:25
Final question: is there any way I can combine all the lists into one, similar to the output I have posted? As in, I have only two rows for each `Product Name`. Any idea? I am asking because I want to compare descriptive statistics from "Time 1", "Time 2" and "Time 1 + Time 2". I was thinking to do this comparison row-wise ,and for this, I need all the columns in one data.frame or list. – watchtower Jan 30 '17 at 08:27
If you look at [this post](http://stackoverflow.com/questions/7962267/cbind-a-df-with-an-empty-df-cbind-fill) you ll find some `cbind.fill` functions that you can use. i.e. `do.call(cbind.fill, lapply(l1, function(i) aggregate(i, df, FUN = function(j) c(sum1 = sum(j), avg = mean(j), max_units = max(j)))))` – Sotos Jan 30 '17 at 08:30
To be honest, I am not so comfortable yet with R because I am still a beginner. Is there anyway, you could help me out by producing the same output as the sample output posted above? I did run the command you have posted, but it seems all the columns have been bound together by `cbind.fill`. There are many duplicate columns. So, I'd appreciate your help. – watchtower Jan 30 '17 at 08:37
Well, I don't have time right now to be honest... but as soon as I can I will give it a look. Meanwhile, this could be a good exercise for you – Sotos Jan 30 '17 at 08:45
Sure I will definitely try. However, it will take time because I am a beginner and not really familiar with vectorization in your code. However, I won't give up. I will come back as soon as I have something. In the meantime, if you find chance, I'd truly appreciate your thoughts. – watchtower Feb 02 '17 at 18:42
Quick question: I am trying to understand why you did `unlist` on top of `sapply` for one line and not so for the other. Reading from `sapply` vs. `vapply` vs. `lapply` article by joran, `sapply` should have returned a vector or matrix. Right? I am taking baby steps to understand your code. I'd appreciate your thoughts. – watchtower Feb 02 '17 at 19:05

Descriptive statistics on grouped and ungrouped observations

1 Answers1