Creating a contingency table using multiple columns in a data frame in R

Question

I have a data frame which looks like this:

structure(list(ab = c(0, 1, 1, 1, 1, 0, 0, 0, 1, 1), bc = c(1, 
1, 1, 1, 0, 0, 0, 1, 0, 1), de = c(0, 0, 1, 1, 1, 0, 1, 1, 0, 
1), cl = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 2)), .Names = c("ab", "bc", 
"de", "cl"), row.names = c(NA, -10L), class = "data.frame")

The column cl indicates a cluster association and the variables ab,bc & de carry binary answers, where 1 indicates yes and 0 - No.

I am trying to create a table cross tabbing cluster along with all the other columns in the data frame viz ab, bc and de, where the clusters become column variables. The desired output is like this

I tried the following code:

with(newdf, tapply(newdf[,c(3)], cl, sum))

This provides me values cross tabbing only one column at a time. My data frame has 1600+ columns with 1 cluster column. Can someone help?

It seems you could try with `aggregate`; `aggregate(. ~ cl, newdf, sum)`? — alexis_laz, Oct 31 '15 at 22:38
alexis_laz...thank you for a simple execution. This is really nice, but since my current dataset have 1600+ variables, it becomes a tad bit difficult to read all of them in one go. — Apricot, Nov 01 '15 at 03:42

score 8 · Answer 1 · answered Oct 31 '15 at 19:29

8

In base R:

t(sapply(data[,1:3],function(x) tapply(x,data[,4],sum)))
#   1 2 3
#ab 1 3 2
#bc 2 3 1
#de 2 3 1

answered Oct 31 '15 at 19:29

nicola

24,005
3
35
56

score 7 · Answer 2 · answered Oct 31 '15 at 19:23

One way using dplyr would be:

library(dplyr)
df %>% 
  #group by the varialbe cl
  group_by(cl) %>%
  #sum every column
  summarize_each(funs(sum)) %>%
  #select the three needed columns
  select(ab, bc, de) %>%
  #transpose the df
  t

Output:

   [,1] [,2] [,3]
ab    1    3    2
bc    2    3    1
de    2    3    1

score 6 · Accepted Answer · answered Oct 31 '15 at 19:24

Your data is in a half-long half-wide format, and you want it in a fully wide format. This is easiest if we first covert it to a fully long format:

library(reshape2)
df_long = melt(df, id.vars = "cl")
head(df_long)
#    cl variable value
# 1   1       ab     0
# 2   2       ab     1
# 3   3       ab     1
# 4   1       ab     1
# 5   2       ab     1
# 6   3       ab     0

Then we can turn it into a wide format, using sum as the aggregating function:

dcast(df_long, variable ~ cl, fun.aggregate = sum)
#   variable 1 2 3
# 1       ab 1 3 2
# 2       bc 2 3 1
# 3       de 2 3 1

score 2 · Answer 4 · answered Oct 31 '15 at 19:37

You can also combine tidyr:gather or reshape2::melt and xtabs to have your contengency table

library(tidyr)
xtabs(value ~ key + cl, data = gather(df, key, value, -cl))
##     cl
## key  1 2 3
##   ab 1 3 2
##   bc 2 3 1
##   de 2 3 1

If your prefer to use pipe

df %>%
  gather(key, value, -cl) %>%
  xtabs(value ~ key + cl, data = .)

score 0 · Answer 5 · answered Jul 15 '20 at 15:41

0

Just to update using dplyr's pivot_longer (that supersedes gather) following the code dickoa wrote:

library(dplyr)

df %>% 
pivot_longer(cols = ab:de,
          names_to = "key",
          values_to = "value") %>% 
xtabs(value ~ key + cl, data = .)

answered Jul 15 '20 at 15:41

Zoë Turner

459
5
8

Creating a contingency table using multiple columns in a data frame in R

5 Answers5

Linked