Use R to create a large multiple column frequency table

Question

I am having a hard time doing this efficiently and apologize if it's a basic question. I need to make a contingency table with N and percent to summarize relationships between a large number of binary variables, simply in terms of freq and percent, with no other summary statistics.

Specifically, it is to summarize the number of patients that have sample type X and clinical outcome Y. A patient can have any number of outcomes and any number of samples, i.e. each variable is non-mutually exclusive and independent.

I would like to put all the outcomes (Death, ICU admission, leg fell off...) as columns, and all the sample types (serum, urine, etc...) as rows. I would only need to list the frequency and percentage of "positive" responses, i.e. N and % of patients who died and who had a urine sample.

Are there any packages out there that could help with this kind of table? Everything I am finding is good for doing a nice 1xN variable contingency table. I wouldn't mind making a separate table for each outcome, if I can somehow extract a column of that output and bind them together into a master table to rule them all. Another idea is to somehow make a frequency table of two mChoice (Hmisc package) variables. I do not know if either of those two strategies are possible.

Any ideas?

What I'm looking for is something like this:

+-------------+--------+---------+
|             | Death  | ICU     |
|             | (N=10) | (N=50)  |
+-------------+--------+---------+
|Serum (N=50) |5 (50%) | 30 (60%)|
+-------------+--------+---------+
|Urine (N=40) |10(100%)| 7 (14%) |
+-------------+--------+---------+
|Brain (N=25) |6 (60%) | 15 (30%)|
+-------------+--------+---------+
|Kidney (N=50)|7 (70%) | 40 (80%)|
+-------------+--------+---------+

Edit to include sample data:

set.seed(1)
death<-runif(1000)<=.75
ICU<-runif(1000)<=.63
serum<-runif(1000)<=.80
urine<-runif(1000)<=.77
brain<-runif(1000)<=.92
kidney<-runif(1000)<=.22
df<-as.data.frame(cbind((1:1000),death,ICU,serum,urine,brain,kidney))

Can you give us some sample data? http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Roman Luštrik, Mar 19 '13 at 14:38
Some example data would be: set.seed(1) death<-runif(1000)<=.75 ICU<-runif(1000)<=.63 serum<-runif(1000)<=.80 urine<-runif(1000)<=.77 brain<-runif(1000)<=.92 kidney<-runif(1000)<=.22 df<-as.data.frame(cbind((1:1000),death,ICU,serum,urine,brain,kidney)) where the resulting V1 is the patient ID number. Thanks! — micturalgia, Mar 19 '13 at 17:29
OK, after playing around with this for a while I'm not sure I understand how it's supposed to work. Looking at the example table I note that the column N does not add up to the sum of the cell n's in that column. For example, there are N=10 deaths, but 10 Deaths with Urine and 5 with Serum. How can there be more deaths than there are? What does N=10 Deaths actually mean? — Ista, Mar 20 '13 at 12:36
Yes, the issue is exactly that - the options are not mutually exclusive. A given patient could have all the outcomes, some, or none. Being transferred to the ICU does not preclude the patient from dying or having renal failure, etc. By the same token, a given patient could have a serum sample, a urine sample, both, neither, etc. So if there are N=10 deaths in total, there could be up to but not more than 10 urine samples with death, 10 serum samples with death, etc. The challenge is to show e.g. how many serum samples were collected from a patient who had renal failure. Is this clearer? — micturalgia, Mar 20 '13 at 19:10
OK, it's starting to sink in! @user2186883 please check my edited answer below and see if that does what you are looking for. — Ista, Mar 20 '13 at 19:23

score 2 · Answer 1 · answered Oct 27 '15 at 09:41

Here is a simple and fast solution using the data.table package.

library(data.table)

# convert your data frame to data.table
  setDT(df)


# create the output for serum
  serum <- df[serum==1, .(test="serum",
                          test.N = .N, 
                          death.count = sum(death),
                          death.N = sum(df$death),
                          death.prop=(sum(death)/sum(df$death))*100,
                          icu.count = sum(ICU),
                          icu.N = sum(df$ICU),
                          icu.prop=(sum(ICU)/sum(df$ICU))*100),
                          by=.(serum)]

# create the output for kidney
  kidney<- df[kidney==1, .(test="kidney",
                          test.N = .N, 
                          death.count = sum(death),
                          death.N = sum(df$death),
                          death.prop=(sum(death)/sum(df$death))*100,
                          icu.count = sum(ICU),
                          icu.N = sum(df$ICU),
                          icu.prop=(sum(ICU)/sum(df$ICU))*100),
                          by=.(kidney)]

# Bind outputs into a table
  table <- rbind( serum[,2:9,with = FALSE],
                  kidney[,2:9,with = FALSE])

table
>      test test.N death.count death.N death.prop icu.count icu.N icu.prop
> 1:  serum    806         602     752   80.05319       511   632 80.85443
> 2: kidney    190         141     752   18.75000       128   632 20.25316

score 1 · Accepted Answer · edited May 23 '17 at 11:46

EDIT: This is a revised answer offered after discussing the problem with the original poster. An older answer that does not solve the problem at hand is retained below for posterity.

This answer is not short nor concise, and I do hope there is a cleaner way. But the following will work:

## generate example data
set.seed(1)
death<-runif(1000)<=.75
ICU<-runif(1000)<=.63
serum<-runif(1000)<=.80
urine<-runif(1000)<=.77
brain<-runif(1000)<=.92
kidney<-runif(1000)<=.22
df<-as.data.frame(cbind((1:1000),death,ICU,serum,urine,brain,kidney))

## load up our data manipulation workhorses
library(reshape2)
library(plyr)

## save typing by saving row and column var names
row.vars <- c("serum", "urine", "brain", "kidney")
col.vars <- c("death", "ICU")

## melt data so we have death/icu in a column
dat.m <- melt(df, measure.vars = row.vars)

## get rid of rows with death==0 and ICU==0
dat.m <- dat.m[dat.m$value == 1, ]

## for each of death and icu calculate proportion of 1's
tab <- ddply(dat.m, "variable", function(DF) {
  colwise(function(x) length(x[x==1]))(DF[col.vars])
})

## calculate overall proportions for row and column vars
row.nums <- sapply(df[row.vars], function(x) length(x[x==1]))
col.nums <- sapply(df[col.vars], function(x) length(x[x==1]))

## paste row and column counts into row and column names
rownames(tab) <- paste(tab$variable, " (N=", row.nums, ")", sep="")
tab$variable <- NULL
colnames(tab) <- paste(names(tab), " (N=", col.nums, ")", sep="")

## calculate cell proportions and paste them in one column at a time
tab[[1]] <- paste(tab[[1]],
                  " (",
                  round(100*(tab[[1]]/col.nums[[1]]), digits=2),
                  "%)",
                  sep="")
tab[[2]] <- paste(tab[[2]],
                  " (",
                  round(100*(tab[[2]]/col.nums[[2]]),
                        digits=2),
                  "%)",
                  sep="")

Now we can

## behold the fruits of our labor
tab
               death (N=752)  ICU (N=632)
serum (N=806)   602 (80.05%) 511 (80.85%)
urine (N=739)   556 (73.94%)  462 (73.1%)
brain (N=910)   684 (90.96%) 576 (91.14%)
kidney (N=190)  141 (18.75%) 128 (20.25%)

OLD ANSWER (does not solve problem at hand, but may be useful for related tasks)

This is one of those things that seems like it should be easy, but somehow isn't.

There is an existing question that addresses this once you have two columns ready to tabulate. That part is easy:

# function to genderate example data
mkdat <- function() factor(sample(letters[1:4], 10, replace=TRUE), levels=letters[1:4])

# make example data
set.seed(10)
dat <- data.frame(id = 1:10, var1 = mkdat(), var2=mkdat(), var3=mkdat())

# use reshape2 package to reshape from wide to long form
library(reshape2)
dat.m <- melt(dat, id.vars="id")
dat.m$value <- factor(dat.m$value)

Now the cross tab of dat.m$variable and dat.m$value give the correct cells. You can refer to the linked question above on how to proceed from there to get both counts and percents in a table, or you can use this method:

# tabulate
library(plyr)
tab <- ddply(dat.m, "variable",
             function(DF) {
               # get counts with table
               count <- table(DF$value)
               # convert counts to percent
               prop <- paste(prop.table(count)*100, "%", sep="")
               # combine count and percent
               cp <- paste(count, " (", prop, ")", sep="")
               # re-attach the names
               names(cp) <- levels(DF$value)
               return(cp)
             })

# get row n
tab.r <- table(dat.m$variable)
# get column n
tab.c <- table(dat.m$value)
# paste row and column n into row and column names
colnames(tab) <- paste(colnames(tab), " (n = ", tab.c, ")", sep="")
rownames(tab) <- paste(tab$variable, " (n = ", tab.r, ")", sep="")
tab$variable <- NULL

# works, but that was way too much effort.
print(tab)

It has to be admitted that this is a lot of work for a simple count-and-proportion table. I'll be delighted if someone comes along with a simpler way to do it.

Thanks a lot for the quick reply. Unfortunately since I forgot to put in sample data (edited now to include), your response unfortunately doesn't quite get at my question. Each column and each row needs to be a different binary variable that is independent of the other columns. I.e. the sum of each row or column is not equal to 100%. A given individual (V1 in my example data) could have a positive response for one or more, or none, of the "outcome" variables, and the same for the "sample variables". The table should show how many patients with any given combo of outcome and sample. Thanks! — micturalgia, Mar 19 '13 at 17:43
OK sorry about that @user2186883. I'll try to take another look at this later. — Ista, Mar 19 '13 at 18:28
No problem and thanks for giving it a second try, @Ista! In actuality, your response helped me with a different problem I was having. — micturalgia, Mar 19 '13 at 18:56
I took another stab at it. @user2186883 can you take a look and see if this is what you're after? — Ista, Mar 20 '13 at 17:12
Hi @Ista, thanks so much. This is definitely what i was after, and I really appreciate the effort. You saved me a lot of tedious cutting and pasting. — micturalgia, Mar 20 '13 at 21:45

Use R to create a large multiple column frequency table

2 Answers2