2

Using R, I want to create a table with nested columns (and possibly nested rows). The cell values should include frequencies and within-sub-group totals and percentages (i.e., cell/[sub-group row total]*100).

I would prefer that the solution uses the tables package. I will be outputting to LaTeX.

Three questions:

1) Please explain why I'm getting the following error: Error in Percent("row") : Summary fn not allowed with Percent

library(tables)
set.seed(123)
df <- data.frame(exposure = sample(LETTERS[1:5], 100, TRUE),
             Group = sample(c("GroupX","GroupY"), 100, TRUE),
             disease = as.integer(sample(c(0,1), 100, TRUE)))

num <- function(x) base::sum(x, na.rm=TRUE)
tabular(Factor(exposure)+1~
          Factor(Group)*
          (Heading()*num*Heading(One)*disease*
             ((Total=1)+Percent("row"))), 
        data=df)

2) How can I create the following ideal table WITH additional columns for within-group percentages after each group*disease frequency. Note that persons without disease are not included in the table.

          Group                        
          GroupX         GroupY        
                 num            num    
 exposure Total  disease Total  disease
 A         9      4      13      6     
 B        12      4       9      5     
 C         9      8       9      6     
 D         7      1       8      3     
 E         9      4      15     12     
 All      46     21      54     32

Here is a start:

tabular(Factor(exposure) + 1 ~ 
          Factor(Group) * 
            ((Total = 1) + num *  disease), data = df)

3) The package uses Percent(). Why would one use a logical vector with Percent(). Can you give an example? Would using a logical vector help me with this problem?

This is similar to this question; however, the offered answer calculates incorrect percentages as evidenced by an example with more than 2 columns.

Community
  • 1
  • 1
penguinv22
  • 349
  • 5
  • 12
  • @SlowLearner I removed all of the other troubleshooting I had done. I struggle with wanting to document everything I've tried vs. succinctly asking for help. I kept `data.frame()` to produce my MRE rather than using `dput()` because code for dput() is much longer to achieve same result, and in this case, I've found that more observations in the dataset are helpful in tracing what percentages are being calculated. See my comment about a previous [SO](http://stackoverflow.com/questions/19319825/in-the-tables-package-how-to-get-column-percentages-of-a-subset-of-a-variable) question. – penguinv22 Jan 04 '14 at 15:27
  • @SlowLearner `Heading()` removes the label for a column or row. `Heading(value)` applies the label `value`. In some test datasets, especially where I tried to factor the variable `disease` it became visually confusing because the label for the column was `1`. I have further simplified that code. – penguinv22 Jan 04 '14 at 16:39
  • So you're in effect summing the value in the disease column? – SlowLearner Jan 04 '14 at 16:52
  • Yes. I'm treating what is in essence a binary variable as integer so I can sum all the `1` values as a count so that I can only a count of the "positive" results. The alternative is to treat it as a `tables::Factor()`, as I had shown in a previous version of this question, but then you get a frequency column for the `0`'s. – penguinv22 Jan 04 '14 at 16:59

2 Answers2

2

As of version 0.7.72, the tables package can calculate sub-group percentages. Credit for the commits and this answer goes to Duncan Murdoch, the package maintainer.

The updated source package is available by SVN from rForge. General installation instructions for installing source packages can be found here. Refer to answer 2 below for usage. Binary packages might be available by the time you read this.

1) The tables package will only calculate one thing per column. Percent is effectively a "summary function" and by defining a new summary function, num, I've asked it to calculate num in all of the columns, and also to calculate Percent in some of them. If I use my num function, I need to move it within the parentheses so it doesn't end up "multiplied" (in the tables grammar sense) by Percent. The following code will generate a count of persons with disease (i.e., disease==1) and it will generate a row percentage (Group Total/Row Total * 100), not the desired cell/(sub-group row total). With versions of tables < 0.7.72, that's as far as we can get.

library(tables) ## prior to 0.7.72

df <- data.frame(exposure = sample(LETTERS[1:5], 100, TRUE),
                 Group = sample(c("GroupX","GroupY"), 100, TRUE),
                 disease = as.integer(sample(c(0,1), 100, TRUE)))

num <- function(x) base::sum(x, na.rm=TRUE)
tabular(Factor(exposure)+1~
          Factor(Group)*
          (Heading("Group Total")*(1)+num*disease+Percent("row")),
        data=df)

2) Version 0.7.72 of the tables package will calculate the desired sub-group percentages. It introduces a pseudo-function called Equal().

set.seed(100)
library(tables)
df <- data.frame(exposure = sample(LETTERS[1:5], 100, TRUE),
                 Group = sample(c("GroupX","GroupY"), 100, TRUE),
                 disease = as.integer(sample(c(0,1), 100, TRUE)))

myTable <- tabular(Factor(exposure)+1~
                     Factor(Group)*
                     (Heading("Group Total")*(1)+Factor(disease)*((n=1)+Heading("%")*Percent(Equal(exposure,Group)))),
                   data=df)

myTable

myTable generates the following output:

          Group                                                                
          GroupX                             GroupY                            
                      disease                            disease               
                      0             1                    0             1       
 exposure Group Total n       %     n  %     Group Total n       %     n  %    
 A         5           1      20.00  4 80.00  6           3      50.00  3 50.00
 B        17          12      70.59  5 29.41 10           3      30.00  7 70.00
 C        13           4      30.77  9 69.23 10           6      60.00  4 40.00
 D         8           2      25.00  6 75.00 13           7      53.85  6 46.15
 E         7           3      42.86  4 57.14 11           8      72.73  3 27.27
 All      50          22      44.00 28 56.00 50          27      54.00 23 46.00

From Duncan's explanation,

"The general way to read the code above is 'show the percentage of the values in the current cell relative to the values in all cells with equal x and y.'

The x and y are now taken as expressions; it effectively looks through the formula for places where subsetting occurs, and ignores subsetting by other variables."

The last step is to subset the table, like a matrix, to keep only the desired columns (and/or rows), as demonstrated by the last example in the help file of tabular():

myTable[,c(1,4,5,6,9,10)]

This gives the final result:

          Group                                              
          GroupX                    GroupY                   
                      disease                   disease      
                      1                         1            
 exposure Group Total n       %     Group Total n       %    
 A         5           4      80.00  6           3      50.00
 B        17           5      29.41 10           7      70.00
 C        13           9      69.23 10           4      40.00
 D         8           6      75.00 13           6      46.15
 E         7           4      57.14 11           3      27.27
 All      50          28      56.00 50          23      46.00
Community
  • 1
  • 1
penguinv22
  • 349
  • 5
  • 12
0

I may be barking up the wrong tree here, but in the second question above, are you trying to get the percentages of GroupX and GroupY for each category of exposure? If so then ddply or a similar approach in base R should work.

set.seed(123)
df <- data.frame(exposure = sample(LETTERS[1:5], 100, TRUE),
             Group = sample(c("GroupX","GroupY"), 100, TRUE),
             disease = as.integer(sample(c(0,1), 100, TRUE)))

library(plyr)
foo <- ddply(df,
             .(exposure, Group),
             summarise,
             total = sum(disease))
foo
ddply(foo,
      .(exposure),
      summarise,
      group = Group,
      total = total,
      pct.group = total/sum(total))

That gives the following output:

> foo
   exposure  Group total
1         A GroupX     4
2         A GroupY     4
3         B GroupX     8
4         B GroupY     6
5         C GroupX     6
6         C GroupY     4
7         D GroupX     5
8         D GroupY     4
9         E GroupX     4
10        E GroupY     3
> ddply(foo,
+       .(exposure),
+       summarise,
+       group = Group,
+       total = total,
+       pct.group = total/sum(total))
   exposure  group total pct.group
1         A GroupX     4 0.5000000
2         A GroupY     4 0.5000000
3         B GroupX     8 0.5714286
4         B GroupY     6 0.4285714
5         C GroupX     6 0.6000000
6         C GroupY     4 0.4000000
7         D GroupX     5 0.5555556
8         D GroupY     4 0.4444444
9         E GroupX     4 0.5714286
10        E GroupY     3 0.4285714
SlowLearner
  • 7,907
  • 11
  • 49
  • 80
  • This may be one of those circumstances where keeping the calculations separate from the formatting is the way to go. I do like the formula interface of `tables::tabular` and generally I've found it easy to use. A workflow involving this data prep step and use of this neat [matrix trick](http://stackoverflow.com/questions/9438193/getting-both-column-counts-and-proportions-in-the-same-table-in-r) from @IShouldBuyABoat and then pass to `Hmisc::latex` like [this](http://stackoverflow.com/questions/3444440/counts-percentages-in-xtable-sweave-r-cross-tabulations) might be my best bet. – penguinv22 Jan 04 '14 at 16:55