I am trying to mimic the table
Stata command in R, which performs summary statistics tables. The command allows you to create cross tables with diverse statistics inside of the resulting cells. For instance, in my example below, I am crossing three variables (category1
, category2
, and category3
) and getting as a column vector the mean and standard deviation of metric1
and the mean and standard deviation metric2
.
The stated behavior is obtained with the following single line on Stata.
table category1 category2 category3 ,c(mean metric1 sd metric1 mean metric2 sd metric2)
Desired output: Explanation of the table.
Here each column vector of the resulting cross table, lets say X
of the cross table contains X = [mean(metric1),sd(metric1), mean(metric2),sd(metric2)]'
----------------------------------------------------------------------------
| category3 and category2
| ------------ First ----------- ----------- Second -----------
category1 | A B C Total A B C Total
----------+-----------------------------------------------------------------
1 | mean(metric1)
| sd(metric1)
| mean(metric2)
| sd(metric1)
Desired output (!): Resulting Table on Stata.
----------------------------------------------------------------------------
| category3 and category2
| ------------ First ----------- ----------- Second -----------
category1 | A B C Total A B C Total
----------+-----------------------------------------------------------------
1 | 5.778 7.200 2.571 5.048 6.667 3.000 3.000 4.222
| 2.906 3.347 2.507 3.324 2.309 1.414 1.155 2.333
| -1.556 -2.000 -1.143 -1.524 -2.000 -2.000 -3.000 -2.444
| 1.667 0.000 1.069 1.250 0.000 2.828 1.155 1.333
|
2 | 3.200 6.333 4.200 4.571 4.889 5.000 5.000 4.947
| 2.280 3.445 2.741 2.976 3.180 3.464 2.449 2.857
| -0.800 -2.000 -2.000 -1.714 -2.222 -1.500 -1.000 -1.684
| 1.095 1.265 1.333 1.309 1.563 1.000 1.673 1.529
|
3 | 8.667 4.667 5.167 5.667 5.667 6.667 6.000 6.000
| 2.309 2.309 2.758 2.849 3.445 4.163 3.464 3.303
| -3.333 -2.667 -2.000 -2.333 -2.333 -2.000 -1.333 -2.000
| 1.155 1.155 1.477 1.414 0.816 2.000 1.155 1.206
|
Total | 5.529 6.286 4.207 5.067 5.444 5.111 4.615 5.100
| 3.125 3.124 2.795 3.047 3.053 3.333 2.501 2.898
| -1.647 -2.143 -1.793 -1.833 -2.222 -1.778 -1.692 -1.950
| 1.618 0.949 1.346 1.342 1.166 1.563 1.601 1.395
----------------------------------------------------------------------------
Stata code that generates the above result.
clear all
set obs 100
set seed 777
gen category1 = runiformint(1,3)
gen category2_num = runiformint(1,3)
gen category2 = "A" if category2_num ==1
replace category2 = "B" if category2_num ==2
replace category2 = "C" if category2_num ==3
drop category2_num
gen category3_num = runiformint(1,2)
gen category3 = "First" if category3_num ==1
replace category3 = "Second" if category3_num ==2
drop category3_num
gen metric1 = round(runiform()*10,2)
gen metric2 = round(runiform()*-4,2)
table category1 category2 category3 /// List of the variables that will create the crosstab
,c(mean metric1 sd metric1 /// Mean and std.dev of metric1 as 1st and 2nd rows
mean metric2 sd metric2) /// Mean and std.dev of metric2 as 3rd and 4th rows
row col /// Add the over all statistics total rows and cols
format(%9.3f) // Decimal style setting.
R attempt.
Here is how I have tackled the problem. However, I am still far from my desired results. Even though I have the same information displayed on the screen, the readability is very poor in the way I am presenting it on R. Additionally, I haven't computed the mean and standard deviation for the total of rows and columns and I did on the Stata output.
Finally, in my opinion, this procedure seems like an overkill solution for such a simple problem. In my context packages are allowed, hence, dplyr
or data.table
suggestions are welcome.
Include Stata generated data + replication routine.
df <- as.data.frame(structure(list(category1 = structure(c(1, 3, 1, 2, 3, 1, 3, 1,1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2, 1,3, 3, 3, 2, 2, 1, 1, 1, 3, 2, 3, 1, 2, 2, 1, 3, 1, 3, 1, 1, 3,1, 1, 2, 1, 3, 2, 2, 3, 3, 3, 1, 2, 3, 2, 3, 2, 1, 1, 1, 2, 2,2, 1, 3, 2, 2, 2, 3, 3), format.stata = "%9.0g"),
category2 = structure(c("C", "A", "A", "A", "C", "C", "A", "A", "A", "A", "B", "A", "A", "A","A", "B", "A", "C", "C", "B", "C", "A", "A", "C", "A", "B", "C", "B", "C", "C", "A", "C", "B", "B", "A", "B", "C", "A", "B", "B","C", "A", "A", "C", "C", "B", "C", "A", "A", "C", "C", "B", "C", "C", "A", "C", "A", "A", "C", "B", "A", "C", "C", "C", "B", "B","C", "C", "A", "A", "C", "C", "A", "C", "B", "B", "C", "C", "C", "C", "A", "C", "C", "C", "C", "B", "B", "B", "B", "C", "A", "A","C", "C", "A", "A", "A", "B", "B", "C"), format.stata = "%9s"),
category3 = structure(c("First", "Second", "First", "First", "First", "First", "Second", "Second", "First", "Second", "First", "First", "Second", "Second", "First", "Second", "Second", "First", "Second", "First", "First", "First", "First","Second", "First", "First", "Second", "First", "First", "First","First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First","Second", "Second", "First", "Second", "Second", "Second","First", "First", "First", "Second", "Second", "First", "First","Second", "First", "First", "First", "First", "Second", "First","Second", "Second", "First", "Second", "First", "Second", "First", "Second", "First", "First", "First", "First", "Second","First", "First", "First", "Second", "Second", "First", "First","First", "Second", "First", "Second", "First", "Second","Second", "First", "Second", "First", "First", "Second","Second", "Second", "Second", "First"), format.stata = "%9s"),
metric1 = structure(c(0, 10, 0, 0, 8, 4, 4, 8, 8, 2, 4, 4, 6, 2, 6, 8, 6, 4, 4, 10, 10, 4, 6, 8, 6, 2, 4, 4, 6, 0, 6,0, 10, 8, 2, 2, 2, 0, 2, 10, 2, 8, 4, 6, 8, 2, 2, 6, 0, 2,4, 6, 2, 2, 8, 6, 8, 8, 2, 8, 10, 4, 4, 4, 4, 10, 4, 2, 6,4, 6, 4, 10, 2, 8, 6, 8, 2, 6, 6, 6, 4, 8, 6, 8, 2, 10, 2, 6, 2, 10, 4, 8, 0, 10, 6, 4, 2, 8, 8), format.stata = "%9.0g"),
metric2 = structure(c(0, -4, 0, 0, -2, -2, -2, -2, -4, -2, -2, -2, -2, -4, 0, 0, -2, -2, -4, -2, 0, -2, -4, -2, -2, -2, -2, -2, -4, 0, -4, -4, -2, -2, -2, -2, -2, -2, -4, -2, -2, -2, -2, -2, 0, -2, -4, -4, -2, -2, 0, -4, -2, 0, -2,-2, 0, -2, -4, 0, -2, -2, 0, 0, -4, -4, 0, -2, 0, -2, -2, -4, 0, -2, -2, -2, 0, -2, -2, -2, -2, -2, -2, 0, 0, 0, -2, 0, -2, -4, 0, 0, 0, -2, -4, -4, 0, -2, -2, -4), format.stata = "%9.0g")),
row.names = c(NA,-100L), class = c("tbl_df", "tbl", "data.frame")))
# expand grid for every possible value
prs <- expand.grid(cat1 = unique(df$category1) ,
cat2 = unique(df$category2) ,
cat3 = unique(df$category3))
#Number of total combinations
N <- nrow(prs)
#Loop over the combinations to get the desired statistis
A <- lapply(1:N, FUN = function(i){
mean1 <- mean(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
sd1 <- sd(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
mean2 <- mean(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
sd2 <- sd(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
r_list<- list(cat1 = prs$cat1[i],cat2 = prs$cat2[i], cat3 = prs$cat3[i],
mean1 = mean1, sd1 = sd1 , mean2 = mean2, sd2 = sd2)
return(r_list)
})
#List to data.frame
df_stats <- do.call(rbind.data.frame, A)
Obtained output (but, not my desired output (!) )
# cat1 cat2 cat3 mean1 sd1 mean2 sd2
# 2 1 C First 2.571429 2.507133 -1.142857 1.0690450
# 21 3 C First 5.166667 2.757909 -2.000000 1.4770979
# 3 2 C First 4.200000 2.740641 -2.000000 1.3333333
# 4 1 A First 5.777778 2.905933 -1.555556 1.6666667
# 5 3 A First 8.666667 2.309401 -3.333333 1.1547005
# 6 2 A First 3.200000 2.280351 -0.800000 1.0954451
# 7 1 B First 7.200000 3.346640 -2.000000 0.0000000
# 8 3 B First 4.666667 2.309401 -2.666667 1.1547005
# 9 2 B First 6.333333 3.444803 -2.000000 1.2649111
# 10 1 C Second 3.000000 1.154701 -3.000000 1.1547005
# 11 3 C Second 6.000000 3.464102 -1.333333 1.1547005
# 12 2 C Second 5.000000 2.449490 -1.000000 1.6733201
# 13 1 A Second 6.666667 2.309401 -2.000000 0.0000000
# 14 3 A Second 5.666667 3.444803 -2.333333 0.8164966
# 15 2 A Second 4.888889 3.179797 -2.222222 1.5634719
# 16 1 B Second 3.000000 1.414214 -2.000000 2.8284271
# 17 3 B Second 6.666667 4.163332 -2.000000 2.0000000
# 18 2 B Second 5.000000 3.464102 -1.500000 1.0000000