Automate Chi-square across categories and columns

Question

I have a survey dataframe containing several questions (columns) coded as 1=agree/0=disagree. Respondents (rows) are categorized according to metrics "age" ("young","middle","old"), "region" ("East","Mid","West"), etc. There are around 30 categories in total (3 ages, 3 regions, 2 genders, 11 occupations, etc.). Within each metric, categories are non-overlapping and of different sizes.

This simulates a cut-down version of the dataset:

n<-400
set.seed(1)
data<-data.frame(age=sample(c('young','middle','old'),n,replace=T),region=sample(c('East','Mid','West'),n,replace=T),gender=sample(c('M','F'),n,replace=T),Q15a=sample(c(0,1),n,replace=T),Q15b=sample(c(0,1),n,replace=T))

I can use Chi-square to test if the responses in, say, the West differ significantly from the total sample, for Q15a, with:

attach(data)
chisq.test(table(subset(data,region=='West')$Q15a),p=table(Q15a),rescale.p=T)

I want to test all categories against the total sample for Q15a, and then for ~20 other questions. As there are around 30 tests per question, I want to find a way (efficient or otherwise) to automate this, but am struggling to see how to get R to do this itself or how to write a loop to cycle through the categories. I've searched[1], and got sidetracked into pairwise comparison testing with pairwise.prop.test(), but haven't found anything that really answers this yet.

[1] similar but not duplicate questions (both are column-wise tests):

Using loops to do Chi-Square Test in R

Chi Square Analysis using for loop in R

I think it would be best if you provided a minimal reproducible example. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Roman Luštrik, Mar 06 '14 at 13:59

score 2 · Answer 1 · answered Mar 06 '14 at 16:10

How about this?

# find all question columns containing Q, your "subset" may differ
nms <- names(data)
nms <- nms[grepl("Q", nms)]

result <- sapply(nms, FUN = function(x, data) {
  qinq <- data[, c("region", x)]
  by(data = qinq, INDICES = data$region, FUN = function(y, qinq) {
    chisq.test(table(y[, x]), p =  table(qinq[, x]), rescale.p = TRUE)
  }, qinq = qinq)
}, data = data, simplify = FALSE)

$Q15a
data$region: East

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0.7494, df = 1, p-value = 0.3867

--------------------------------------------------------------------------------------------- 
data$region: Mid

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0.2249, df = 1, p-value = 0.6353

--------------------------------------------------------------------------------------------- 
data$region: West

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 1.5877, df = 1, p-value = 0.2077


$Q15b
data$region: East

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0.0697, df = 1, p-value = 0.7918

--------------------------------------------------------------------------------------------- 
data$region: Mid

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0, df = 1, p-value = 0.9987

--------------------------------------------------------------------------------------------- 
data$region: West

    Chi-squared test for given probabilities

data:  table(y[, x])
X-squared = 0.056, df = 1, p-value = 0.8129

You can extract anything you want. Here's how you would extract a p.value.

lapply(result, FUN = function(x) lapply(x, "[", "p.value"))

$Q15a
$Q15a$East
$Q15a$East$p.value
[1] 0.3866613


$Q15a$Mid
$Q15a$Mid$p.value
[1] 0.6353457


$Q15a$West
$Q15a$West$p.value
[1] 0.2076507



$Q15b
$Q15b$East
$Q15b$East$p.value
[1] 0.7918426


$Q15b$Mid
$Q15b$Mid$p.value
[1] 0.9986924


$Q15b$West
$Q15b$West$p.value
[1] 0.8128969

Happy formatting.

Roman, sorry for the delay in replying. I haven't been able to dissect this but it certainly works on the df I gave - excellent job. It doesn't work on my real df but I'm trying to deconstruct your solution to work out why (so far without luck as I haven't understood how your code works yet!). — Graham Jones, Mar 11 '14 at 14:50
@GrahamJones insert `debug()` inside the first or second anonymous function and step through the code by hand, inspecting objects and statements along the way. http://stackoverflow.com/questions/13663043/exit-current-browser-return-one-level may help. — Roman Luštrik, Mar 11 '14 at 14:55

score 1 · Answer 2 · answered Feb 04 '16 at 07:18

You may also use chisq.desc() function from EnQuireR package. It worked fine for me. ALthough there is very less support available and this package is quite old (no updates from long), so few functions were not working but I find chisq.desc() useful. It Color the cells of the table containing the results from the Chi-square test, crossing all the selected categorical variables, according to a selected threshold. I am unable to comment, so writing as an answer.

Automate Chi-square across categories and columns

2 Answers2

Linked