R equivalent of `table ,contents( )` Stata command for summary statistics

Question

I am trying to mimic the table Stata command in R, which performs summary statistics tables. The command allows you to create cross tables with diverse statistics inside of the resulting cells. For instance, in my example below, I am crossing three variables (category1, category2, and category3) and getting as a column vector the mean and standard deviation of metric1 and the mean and standard deviation metric2.

The stated behavior is obtained with the following single line on Stata.

table category1 category2 category3 ,c(mean metric1 sd metric1 mean metric2 sd metric2)

Desired output: Explanation of the table.

Here each column vector of the resulting cross table, lets say X of the cross table contains X = [mean(metric1),sd(metric1), mean(metric2),sd(metric2)]'

----------------------------------------------------------------------------
          |                     category3 and category2                     
          | ------------ First -----------    ----------- Second -----------
category1 |      A        B       C   Total         A       B       C   Total
----------+-----------------------------------------------------------------
        1 |  mean(metric1)  
          |  sd(metric1)  
          |  mean(metric2)  
          |  sd(metric1)

Desired output (!): Resulting Table on Stata.


----------------------------------------------------------------------------
          |                     category3 and category2                     
          | ------------ First -----------    ----------- Second -----------
category1 |      A       B       C   Total         A       B       C   Total
----------+-----------------------------------------------------------------
        1 |  5.778   7.200   2.571   5.048     6.667   3.000   3.000   4.222
          |  2.906   3.347   2.507   3.324     2.309   1.414   1.155   2.333
          | -1.556  -2.000  -1.143  -1.524    -2.000  -2.000  -3.000  -2.444
          |  1.667   0.000   1.069   1.250     0.000   2.828   1.155   1.333
          | 
        2 |  3.200   6.333   4.200   4.571     4.889   5.000   5.000   4.947
          |  2.280   3.445   2.741   2.976     3.180   3.464   2.449   2.857
          | -0.800  -2.000  -2.000  -1.714    -2.222  -1.500  -1.000  -1.684
          |  1.095   1.265   1.333   1.309     1.563   1.000   1.673   1.529
          | 
        3 |  8.667   4.667   5.167   5.667     5.667   6.667   6.000   6.000
          |  2.309   2.309   2.758   2.849     3.445   4.163   3.464   3.303
          | -3.333  -2.667  -2.000  -2.333    -2.333  -2.000  -1.333  -2.000
          |  1.155   1.155   1.477   1.414     0.816   2.000   1.155   1.206
          | 
    Total |  5.529   6.286   4.207   5.067     5.444   5.111   4.615   5.100
          |  3.125   3.124   2.795   3.047     3.053   3.333   2.501   2.898
          | -1.647  -2.143  -1.793  -1.833    -2.222  -1.778  -1.692  -1.950
          |  1.618   0.949   1.346   1.342     1.166   1.563   1.601   1.395
----------------------------------------------------------------------------

Stata code that generates the above result.

clear all
set obs 100

set seed 777
gen category1 = runiformint(1,3)
gen category2_num = runiformint(1,3)

gen category2 = "A" if category2_num ==1
replace category2 = "B" if category2_num ==2
replace category2 = "C" if category2_num ==3

drop category2_num

gen category3_num = runiformint(1,2)
gen category3 = "First" if category3_num ==1
replace category3 = "Second" if category3_num ==2

drop category3_num

gen metric1 = round(runiform()*10,2)
gen metric2 = round(runiform()*-4,2)

table category1 category2 category3 /// List of the variables that will create the crosstab
         ,c(mean metric1 sd metric1 /// Mean and std.dev of metric1 as 1st and 2nd rows
         mean metric2 sd metric2)   /// Mean and std.dev of metric2 as 3rd and 4th rows
         row col                    /// Add the over all statistics total rows and cols
         format(%9.3f)              // Decimal style setting.

R attempt.

Here is how I have tackled the problem. However, I am still far from my desired results. Even though I have the same information displayed on the screen, the readability is very poor in the way I am presenting it on R. Additionally, I haven't computed the mean and standard deviation for the total of rows and columns and I did on the Stata output.

Finally, in my opinion, this procedure seems like an overkill solution for such a simple problem. In my context packages are allowed, hence, dplyr or data.table suggestions are welcome.

Include Stata generated data + replication routine.

df <- as.data.frame(structure(list(category1 = structure(c(1, 3, 1, 2, 3, 1, 3, 1,1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2, 1,3, 3, 3, 2, 2, 1, 1, 1, 3, 2, 3, 1, 2, 2, 1, 3, 1, 3, 1, 1, 3,1, 1, 2, 1, 3, 2, 2, 3, 3, 3, 1, 2, 3, 2, 3, 2, 1, 1, 1, 2, 2,2, 1, 3, 2, 2, 2, 3, 3), format.stata = "%9.0g"), 
                    category2 = structure(c("C", "A", "A", "A", "C", "C", "A", "A", "A", "A", "B", "A", "A", "A","A", "B", "A", "C", "C", "B", "C", "A", "A", "C", "A", "B", "C", "B", "C", "C", "A", "C", "B", "B", "A", "B", "C", "A", "B", "B","C", "A", "A", "C", "C", "B", "C", "A", "A", "C", "C", "B", "C", "C", "A", "C", "A", "A", "C", "B", "A", "C", "C", "C", "B", "B","C", "C", "A", "A", "C", "C", "A", "C", "B", "B", "C", "C", "C", "C", "A", "C", "C", "C", "C", "B", "B", "B", "B", "C", "A", "A","C", "C", "A", "A", "A", "B", "B", "C"), format.stata = "%9s"), 
                    category3 = structure(c("First", "Second", "First", "First", "First", "First", "Second", "Second", "First", "Second", "First", "First", "Second", "Second", "First", "Second", "Second", "First", "Second", "First", "First", "First", "First","Second", "First", "First", "Second", "First", "First", "First","First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First","Second", "Second", "First", "Second", "Second", "Second","First", "First", "First", "Second", "Second", "First", "First","Second", "First", "First", "First", "First", "Second", "First","Second", "Second", "First", "Second", "First", "Second", "First", "Second", "First", "First", "First", "First", "Second","First", "First", "First", "Second", "Second", "First", "First","First", "Second", "First", "Second", "First", "Second","Second", "First", "Second", "First", "First", "Second","Second", "Second", "Second", "First"), format.stata = "%9s"),
                    metric1 = structure(c(0, 10, 0, 0, 8, 4, 4, 8, 8, 2, 4, 4, 6, 2, 6, 8, 6, 4, 4, 10, 10, 4, 6, 8, 6, 2, 4, 4, 6, 0, 6,0, 10, 8, 2, 2, 2, 0, 2, 10, 2, 8, 4, 6, 8, 2, 2, 6, 0, 2,4, 6, 2, 2, 8, 6, 8, 8, 2, 8, 10, 4, 4, 4, 4, 10, 4, 2, 6,4, 6, 4, 10, 2, 8, 6, 8, 2, 6, 6, 6, 4, 8, 6, 8, 2, 10, 2, 6, 2, 10, 4, 8, 0, 10, 6, 4, 2, 8, 8), format.stata = "%9.0g"),
                    metric2 = structure(c(0, -4, 0, 0, -2, -2, -2, -2, -4, -2, -2, -2, -2, -4, 0, 0, -2, -2, -4, -2, 0, -2, -4, -2, -2, -2, -2, -2, -4, 0, -4, -4, -2, -2, -2, -2, -2, -2, -4, -2, -2, -2, -2, -2, 0, -2, -4, -4, -2, -2, 0, -4, -2, 0, -2,-2, 0, -2, -4, 0, -2, -2, 0, 0, -4, -4, 0, -2, 0, -2, -2, -4, 0, -2, -2, -2, 0, -2, -2, -2, -2, -2, -2, 0, 0, 0, -2, 0, -2, -4, 0, 0, 0, -2, -4, -4, 0, -2, -2, -4), format.stata = "%9.0g")), 
               row.names = c(NA,-100L), class = c("tbl_df", "tbl", "data.frame")))

# expand grid for every possible value
prs <- expand.grid(cat1 = unique(df$category1)   ,
                   cat2 = unique(df$category2) ,
                   cat3 = unique(df$category3))

#Number of total combinations 
N <-   nrow(prs)
#Loop over the combinations to get the desired statistis
A <- lapply(1:N, FUN = function(i){
      mean1 <- mean(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
      sd1   <- sd(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
        
      mean2 <- mean(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
      sd2   <- sd(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
        
      r_list<- list(cat1 = prs$cat1[i],cat2 = prs$cat2[i], cat3 = prs$cat3[i],
                    mean1 = mean1,  sd1 = sd1 , mean2 = mean2, sd2 = sd2)
  return(r_list)
})

#List to data.frame
df_stats <- do.call(rbind.data.frame, A)

Obtained output (but, not my desired output (!) )

# cat1 cat2   cat3    mean1      sd1     mean2       sd2
# 2     1    C  First 2.571429 2.507133 -1.142857 1.0690450
# 21    3    C  First 5.166667 2.757909 -2.000000 1.4770979
# 3     2    C  First 4.200000 2.740641 -2.000000 1.3333333
# 4     1    A  First 5.777778 2.905933 -1.555556 1.6666667
# 5     3    A  First 8.666667 2.309401 -3.333333 1.1547005
# 6     2    A  First 3.200000 2.280351 -0.800000 1.0954451
# 7     1    B  First 7.200000 3.346640 -2.000000 0.0000000
# 8     3    B  First 4.666667 2.309401 -2.666667 1.1547005
# 9     2    B  First 6.333333 3.444803 -2.000000 1.2649111
# 10    1    C Second 3.000000 1.154701 -3.000000 1.1547005
# 11    3    C Second 6.000000 3.464102 -1.333333 1.1547005
# 12    2    C Second 5.000000 2.449490 -1.000000 1.6733201
# 13    1    A Second 6.666667 2.309401 -2.000000 0.0000000
# 14    3    A Second 5.666667 3.444803 -2.333333 0.8164966
# 15    2    A Second 4.888889 3.179797 -2.222222 1.5634719
# 16    1    B Second 3.000000 1.414214 -2.000000 2.8284271
# 17    3    B Second 6.666667 4.163332 -2.000000 2.0000000
# 18    2    B Second 5.000000 3.464102 -1.500000 1.0000000

B. Christian Kamgang · Accepted Answer · 2020-12-20T11:49:49.543

You could use data.table and magrittr packages as follows:

library(magrittr)
library(data.table)

# function to compute the mean and sd
fun <- function(x, y) list(metric1_meam=mean(x), metric1_sd=sd(x), metric2_meam=mean(y), metric2_sd=sd(y))

# compute the Total column, and A,B,C columns of the desired output as follows and bind them 

setDT(df)[, 'category1' := as.character(category1)]

Y <- rbind(
  df[, fun(metric1, metric2), by=.(category1, category2, category3)],
  df[, fun(metric1, metric2), by=.(category1, category3)][, category2 := 'Total'],
  df[, fun(metric1, metric2), by=.(category2, category3)][, category1 := 'Total'],
  df[, fun(metric1, metric2), by=.(category3)][, c('category1', 'category2') := 'Total']
)

# generate the desired output
melt(Y, measure=patterns('metric')) %>% 
  xtabs(formula = value ~ .) %>% 
  ftable(col.vars = c('category3', 'category2'))





                       category3      First                                      Second                                 
                       category2          A          B          C      Total          A          B          C      Total
category1 variable                                                                                                      
1         metric1_meam            5.7777778  7.2000000  2.5714286  5.0476190  6.6666667  3.0000000  3.0000000  4.2222222
          metric1_sd              2.9059326  3.3466401  2.5071327  3.3237959  2.3094011  1.4142136  1.1547005  2.3333333
          metric2_meam           -1.5555556 -2.0000000 -1.1428571 -1.5238095 -2.0000000 -2.0000000 -3.0000000 -2.4444444
          metric2_sd              1.6666667  0.0000000  1.0690450  1.2497619  0.0000000  2.8284271  1.1547005  1.3333333
2         metric1_meam            3.2000000  6.3333333  4.2000000  4.5714286  4.8888889  5.0000000  5.0000000  4.9473684
          metric1_sd              2.2803509  3.4448028  2.7406406  2.9760952  3.1797973  3.4641016  2.4494897  2.8572264
          metric2_meam           -0.8000000 -2.0000000 -2.0000000 -1.7142857 -2.2222222 -1.5000000 -1.0000000 -1.6842105
          metric2_sd              1.0954451  1.2649111  1.3333333  1.3093073  1.5634719  1.0000000  1.6733201  1.5294382
3         metric1_meam            8.6666667  4.6666667  5.1666667  5.6666667  5.6666667  6.6666667  6.0000000  6.0000000
          metric1_sd              2.3094011  2.3094011  2.7579087  2.8491485  3.4448028  4.1633320  3.4641016  3.3028913
          metric2_meam           -3.3333333 -2.6666667 -2.0000000 -2.3333333 -2.3333333 -2.0000000 -1.3333333 -2.0000000
          metric2_sd              1.1547005  1.1547005  1.4770979  1.4142136  0.8164966  2.0000000  1.1547005  1.2060454
Total     metric1_meam            5.5294118  6.2857143  4.2068966  5.0666667  5.4444444  5.1111111  4.6153846  5.1000000
          metric1_sd              3.1248529  3.1238185  2.7951400  3.0469027  3.0529103  3.3333333  2.5012817  2.8982753
          metric2_meam           -1.6470588 -2.1428571 -1.7931034 -1.8333333 -2.2222222 -1.7777778 -1.6923077 -1.9500000
          metric2_sd              1.6179144  0.9492623  1.3464055  1.3424827  1.1659662  1.5634719  1.6012815  1.3950462

jay.sf · Answer 2 · 2020-12-18T15:19:03.020

You may exploit the power of aggregate.

FUN <- function(x) c(mean=mean(x), sd=sd(x))
aggregate(cbind(metric1, metric2) ~ ., df, FUN)
#    category1 category2 category3 metric1.mean metric1.sd metric2.mean metric2.sd
# 1          1         A     First     5.777778   2.905933   -1.5555556  1.6666667
# 2          2         A     First     3.200000   2.280351   -0.8000000  1.0954451
# 3          3         A     First     8.666667   2.309401   -3.3333333  1.1547005
# 4          1         B     First     7.200000   3.346640   -2.0000000  0.0000000
# 5          2         B     First     6.333333   3.444803   -2.0000000  1.2649111
# 6          3         B     First     4.666667   2.309401   -2.6666667  1.1547005
# 7          1         C     First     2.571429   2.507133   -1.1428571  1.0690450
# 8          2         C     First     4.200000   2.740641   -2.0000000  1.3333333
# 9          3         C     First     5.166667   2.757909   -2.0000000  1.4770979
# 10         1         A    Second     6.666667   2.309401   -2.0000000  0.0000000
# 11         2         A    Second     4.888889   3.179797   -2.2222222  1.5634719
# 12         3         A    Second     5.666667   3.444803   -2.3333333  0.8164966
# 13         1         B    Second     3.000000   1.414214   -2.0000000  2.8284271
# 14         2         B    Second     5.000000   3.464102   -1.5000000  1.0000000
# 15         3         B    Second     6.666667   4.163332   -2.0000000  2.0000000
# 16         1         C    Second     3.000000   1.154701   -3.0000000  1.1547005
# 17         2         C    Second     5.000000   2.449490   -1.0000000  1.6733201
# 18         3         C    Second     6.000000   3.464102   -1.3333333  1.1547005

For cross tabulation try xtabs.

aggregate yields matrices in columns when multiple functions are applied(see this answer, why), so first we get rid of them.

r <- do.call(data.frame, aggregate(cbind(metric1, metric2) ~ ., df, FUN))

Now we can apply xtabs, e.g. For each of the category3.

xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == "First", 1:5])
# , , category3 = First,  = metric1.mean
# 
#          category2
# category1        A        B        C
#         1 5.777778 7.200000 2.571429
#         2 3.200000 6.333333 4.200000
#         3 8.666667 4.666667 5.166667
# 
# , , category3 = First,  = metric1.sd
# 
#           category2
# category1        A        B        C
#         1 2.905933 3.346640 2.507133
#         2 2.280351 3.444803 2.740641
#         3 2.309401 2.309401 2.757909

xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == "Second", 1:5])
# , , category3 = Second,  = metric1.mean
# 
#          category2
# category1        A        B        C
#         1 6.666667 3.000000 3.000000
#         2 4.888889 5.000000 5.000000
#         3 5.666667 6.666667 6.000000
# 
# , , category3 = Second,  = metric1.sd
# 
#          category2
# category1        A        B        C
#         1 2.309401 1.414214 1.154701
#         2 3.179797 3.464102 2.449490
#         3 3.444803 4.163332 3.464102

Or use sapply to do this in one step.

sapply(c("First", "Second"), function(c3) 
  xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == c3, 1:5]),
  simplify="array")
# , , category3 = First,  = metric1.mean,  = First
# 
#          category2
# category1        A        B        C
#         1 5.777778 7.200000 2.571429
#         2 3.200000 6.333333 4.200000
#         3 8.666667 4.666667 5.166667
# 
# , , category3 = First,  = metric1.sd,  = First
# 
#          category2
# category1        A        B        C
#         1 2.905933 3.346640 2.507133
#         2 2.280351 3.444803 2.740641
#         3 2.309401 2.309401 2.757909
# 
# , , category3 = First,  = metric1.mean,  = Second
# 
#          category2
# category1        A        B C
#         1 6.666667 3.000000 3
#         2 4.888889 5.000000 5
#         3 5.666667 6.666667 6
# 
# , , category3 = First,  = metric1.sd,  = Second
# 
#          category2
# category1        A        B        C
#         1 2.309401 1.414214 1.154701
#         2 3.179797 3.464102 2.449490
#         3 3.444803 4.163332 3.464102

Wow, this is an amazing piece of code @jay.sf ; thanks for it. I see that you arrived at the same solutions that my spaghetti R code generated. Do you have any suggestions for my desired output with the cross-table format? — Álvaro A. Gutiérrez-Vargas, Dec 18 '20 at 14:54
@ÁlvaroA.GutiérrezVargas Without "overkill" maybe using `xtabs`? See updated answer. — jay.sf, Dec 18 '20 at 15:20
thank you for your suggestion of using `xtabs`; I haven't heard of it. As you can guess, I am fairly new to R. I think your suggestion is a huge improvement towards the solution. However, I still can figure out how to shuffle the order of the computed statistics because now I have two blocks of `means` and `sd` on top of each other, but not exactly as the desired output, which puts `mean1` `sd1` `mean2` `sd2` in this order as a column vector on each of the cells in the cross-table. Thx! — Álvaro A. Gutiérrez-Vargas, Dec 18 '20 at 15:50