0

I have a contingency table, eg the built in Titanic dataset, and I want a way to drop a variable and merge all the values together. Sort of project the data down onto the lower dimensional space.

e.g. Just looking at one 2-d slice of the table

      Sex
Class  Male Female
  1st    57    140
  2nd    14     80
  3rd    75     76
  Crew  192     20

If we were to drop the Sex variable, I would want to end up with a 1-d contingency table that looked like:

Class  Freq
  1st   197
  2nd    94
  3rd   151
  Crew  212

My actual use case is an N dimensional table that I want to be able to construct all N 1-way and N*(N-1)/2 2-way tables from. It feels like there should be a simple way to get this to work.

EDIT: Note that this is not a duplicate of the question this has been linked with, as that is referring to data tables, not contingency tables. The solution here is to convert the contingency table to a data table, then use xtabs to get back to a contingency table. The referenced solution only deals with the case of starting with a data table and wanting to end up with a data table.

user2711915
  • 2,704
  • 1
  • 18
  • 17
  • you should provide sample data or specify which built-in Titanic dataset you are using. (There's one in `datasets`, another in `rpart.plot` that is similar to the Kaggle one) – C8H10N4O2 Sep 22 '16 at 18:52
  • As I said, the example I was using was the built in Titanic {datasets}. The one you get by just typing Titanic. I can't even find the rpart.plot version. Is that a contingency table? Either way, the content of the data doesn't matter, just how to manipulate it. – user2711915 Sep 23 '16 at 11:37

1 Answers1

1
data(Titanic)
library(dplyr)

as.data.frame(Titanic) %>% group_by(Class) %>% summarise(n=sum(Freq))

# Class     n
# (fctr) (dbl)
# 1    1st   325
# 2    2nd   285
# 3    3rd   706
# 4   Crew   885

or data.table:

library(data.table)
as.data.table(Titanic)[, .(n = sum(N)), keyby=Class]

you can make a vector of dim names and then loop over get(dimname) in dplyr or data.table to do 1-way or 2-way freqs.

example:

dims <- c('Class','Sex','Age')
dt <- as.data.table(Titanic)
for(dim in dims)
  print(dt[, .(n = sum(N)), keyby = get(dim)])

Note that get is one way of passing a variable name to do the frequency tables programmatically.

To do a 2-way table in data.table, you can use dcast:

dcast.data.table(dt, Age ~ Class, value.var='N', fun.aggregate=sum)
#      Age 1st 2nd 3rd Crew
# 1: Adult 319 261 627  885
# 2: Child   6  24  79    0

To produce multiple 2-way tables with dcast you would need to build the formula programatically, e.g. formula = as.formula(paste(v1,v2,sep='~'))

Since data.table syntax takes some getting used to, if you want to stay inside the 'tidyverse' for 2-way tables you can just do:

data(Titanic)
library('dplyr')
library('tidyr')

as.data.frame(Titanic) %>% 
  group_by(Age,Class) %>% 
  summarise(n=sum(Freq)) %>%
  spread(Class, n)

#      Age   1st   2nd   3rd  Crew
#   (fctr) (dbl) (dbl) (dbl) (dbl)
# 1  Child     6    24    79     0
# 2  Adult   319   261   627   885
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • This is excellent, thank you. I tries using the data.table version, and that allows you to do exactly what I want. It seems you don't even need the get() in the keyby argument, it will work with a string name. For example: `as.data.table(Titanic)[, (n = sum(N)), keyby = c("Class", "Age")]` and the order in which the elements of keyby are specified will determine the order the columns are created in the output. – user2711915 Sep 23 '16 at 11:43
  • Even though this solution is incomplete, I had marked this as the answer, assuming I could suggest remaining portion as an edit. As you have rejected the edit, leaving this answer incomplete, I have unmarked it as the answer. So an example including the complete transformation back to a contingency table again, `myData = as.data.table(Titanic)[, (n = sum(N)), keyby = c("Age", "Class")]` `myContingencyTable = xtabs(as.formula("V1 ~ Age + Class"), data = myData)` and myTable is now a contingency table containing only the variables specified. – user2711915 Sep 23 '16 at 15:07
  • @user2711915 please see edit -- using `xtabs` is an unnecessary step if you are already using `as.data.table` – C8H10N4O2 Sep 23 '16 at 15:30
  • Ok, that completes the solution, thank you. I'm slightly confused by the fact that the formula for dcast is `Age ~ Class`, while for xtabs it is `V1 ~ Age + Class`, but either is simple to build programatically. – user2711915 Sep 23 '16 at 15:52