1

thank you for helping me. I know enough about R to be dangerous, but not enough to be good. I'm trying to create a chi square matrix (exactly like a correlation matrix, but with chi sq) from a dataframe. I've searched extensively (including this page: Chi-square p value matrix in r which didn't help) even trying a loop, and I'm not getting there. Here's what I have so far:

R Studio version 1.2.5033 on Windows 10

data - 14 columns of 3401 rows, I'm only using 9 columns for the chi sq, 8 variables are binary, 1 is categorical

I've gotten the chi sq results I need, but I want to combine it in a matrix. As it is a standard matrix where columns get progressively smaller, the columns are not the same length, so most bind commands won't work. I do have plyr, and I am trying to do it that way, but am failing. Here's the code:

CA <- c(1, 0, 0, 1, 1, 1, 0, 1, 0, 0)
Pos <- c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0)
Mon <- c(1, 1, 1, 1, 0, 0, 0, 0, 0, 1)
Sc <- c(0, 0, 0, 0, 1, 1, 0, 0, 1, 0)
ood <- c(1, 1, 1, 1, 0, 1, 1, 0, 0, 1)
Eco <- c(1, 2, 4, 6, 7, 3, 2, 5, 7, 7)
Orp <- c(0, 0, 0, 1, 1, 1, 1, 0, 0, 0)
BC <- c(1, 1, 1, 0, 0, 0, 0, 1, 0, 1)
SA <- c(0, 0, 1, 1, 1, 1, 1, 0, 0, 1)
MV <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 0) 
Ad <- c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
YR <- c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0)
KC <- c(1, 1, 1, 1, 1, 1, 0, 1, 1, 1)
H <- c(1, 1, 1, 1, 1, 1, 1, 0, 0 ,1)
data <- data.frame(CA, Pos, Mon, Sc, ood, Eco, Orp, BC, SA, MV, Ad, YR, KC, H)

Or, here is the dput() data:

structure(list(CA = c(0, 0, 1, 1, 1, 1, 0, 1, 0), Pos = c(0, 
0, 0, 0, 1, 1, 0, 1, 0), Mon = c(1, 0, 1, 1, 0, 1, 1, 1, 1), 
    Sc = c(0, NA, 0, 0, 1, 1, 1, 1, 0), ood = c(1, 1, 1, 0, 1, 
    0, 1, 1, 1), Eco = c(7, 6, 7, 0, 1, 6, 5, 1, 3), Orp = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0), BC = c(1, 0, 1, 1, 1, 1, 1, 1, 1
    ), SA = c(0, 1, 0, 0, 0, 0, 0, 0, 0), MV = c(15, 13, 16, 
    12, 16, 14, 11, 18, 12), Ad = c(2, 2, 2, 3, 3, 2, 2, 2, 2
    ), YR = c(2, 2, 1, 1, 2, 2, 2, 1, 2), KC = c(2, 2, 1, 1, 
    1, 1, 1, 1, 1), H = c(0, 1, 1, 0, 0, 1, 1, 0, 1)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
    cols = list(CA = structure(list(), class = c("collector_double", 
    "collector")), Pos = structure(list(), class = c("collector_double", 
    "collector")), Mon = structure(list(), class = c("collector_double", 
    "collector")), Sc = structure(list(), class = c("collector_double", 
    "collector")), ood = structure(list(), class = c("collector_double", 
    "collector")), Eco = structure(list(), class = c("collector_double", 
    "collector")), Orp = structure(list(), class = c("collector_double", 
    "collector")), BC = structure(list(), class = c("collector_double", 
    "collector")), SA = structure(list(), class = c("collector_double", 
    "collector")), MV = structure(list(), class = c("collector_double", 
    "collector")), Ad = structure(list(), class = c("collector_double", 
    "collector")), YR = structure(list(), class = c("collector_double", 
    "collector")), KC = structure(list(), class = c("collector_double", 
    "collector")), H = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))
library(tidyverse)
library(plyr)
library(reshape2)

chisqstat <- function(x, y){round(chisq.test(x, y)$statistic, 2)} #chi sq formula, pulling stat
chisqp <- function(x, y){round(chisq.test(x, y)$p.value, 4)} #chi sq formula pulling pvalue

chisqdata <- list(apply(data[c(2:7, 9, 14)], 2, chisqstat, y = data$CA), #chi sq statistic for all values with child abuse
                       apply(data[c(2:7, 9, 14)], 2, chisqp, y = data$CA), #chi sq pvalue for all values with child abuse
                       apply(data[c(3:7, 9, 14)], 2, chisqstat, y = data$Pos),
                       apply(data[c(3:7, 9, 14)], 2, chisqp, y = data$Pos),
                       apply(data[c(4:7, 9, 14)], 2, chisqstat, y = data$Mon),
                       apply(data[c(4:7, 9, 14)], 2, chisqp, y = data$Mon),
                       apply(data[c(5:7, 9, 14)], 2, chisqstat, y = data$Sc),
                       apply(data[c(5:7, 9, 14)], 2, chisqp, y = data$Sc),
                       apply(data[c(6:7, 9, 14)], 2, chisqstat, y = data$ood),
                       apply(data[c(6:7, 9, 14)], 2, chisqp, y = data$ood),
                       apply(data[c(7, 9, 14)], 2, chisqstat, y = data$Eco),
                       apply(data[c(7, 9, 14)], 2, chisqp, y = data$Eco),
                       apply(data[c(9, 14)], 2, chisqstat, y = data$Orp),
                       apply(data[c(9, 14)], 2, chisqp, y = data$Orp),
                       apply(data[c(14)], 2, chisqstat, y = data$SA),
                       apply(data[c(14)], 2, chisqp, y = data$SA))
do.call(rbind.fill.matrix(), chisqdata)

When I do each line as:

as.matrix(apply(data[c(2:7, 9, 14], 2, chisqstat, y = data$CA))

I get exactly what I want with row names so I'm wondering if I need to use as.matrix. I know I can do a full matrix where the info above and below the diagonal is the same, but I find that visually overwhelming. I want to get something like this (this is correlations not chi square):

       CA   Pos   Mon    Sc   ood   Eco
Pos  0.20                              
Mon -0.20  0.60                        
Sc   0.22 -0.22 -0.65                  
ood -0.22  0.22  0.65 -0.52            
Eco  0.00 -0.18 -0.18  0.38 -0.58      
Orp  0.41  0.00 -0.41  0.36  0.09  0.04

I'm sure I'm missing something obvious, so your help is greatly appreciated. Or maybe there is a much nicer way to do this in general, again, thank you.

  • Please add data using `dput` or something that we can copy and use. Also show expected output for the data shared. Read about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and [how to give a reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Nov 08 '20 at 01:38
  • @user10990002 That looked like a helpful [link](https://stackoverflow.com/questions/32732582/chi-square-p-value-matrix-in-r) you found, and seemed like it should work with your data. What was the problem with that solution? – Ben Nov 08 '20 at 20:24
  • @Ben That was the loop I tried. I couldn't get it to work. I googled the error and tried everything I could think of, but it never worked. I just don't understand i well enough to fix it. I get this error: Error in chisq.test(x[, i], x[, j], ) : 'x' and 'y' must have the same length Called from: chisq.test(x[, i], x[, j], ) Error during wrapup: unimplemented type (29) in 'eval' Error: no more error handlers available (recursive errors?); invoking 'abort' restart Error during wrapup: INTEGER() can only be applied to a 'integer', not a 'unknown type #29' – user10990002 Nov 08 '20 at 21:24
  • A full lower triangular matrix for 14 variables would include values between all pairs of variables (columns): 14 * (14 - 1) /2 = 91 values. You are only computing 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1 = 36 of them. – dcarlson Nov 09 '20 at 01:38
  • @Ben Thank you very much, but I'm still getting the same error. The data is in a spreadsheet. So, it must be something to do with the way the data is uploading. I'm using read_csv, and all columns are numeric. – user10990002 Nov 09 '20 at 06:31
  • @dcarlson I only want 9 of the variables, not all 14. That's why I am using this code: data[c(2:7, 9, 14)]. – user10990002 Nov 09 '20 at 06:33
  • @Ben Thank you very much for helping! Hopefully I added the dput data properly. I just did the first 10 rows. I tried it in r studio, and the first 10 are throwing the same error. – user10990002 Nov 09 '20 at 13:16
  • @user10990002 Please see answer below and let me know if this helps. – Ben Nov 09 '20 at 16:30
  • 1
    @Ben Wow is that cool when it works! Thank you very much. Have a great rest of your week. – user10990002 Nov 09 '20 at 18:42

1 Answers1

1

It looks like your dput data added is a tibble, which is probably from how your data was read in:

> class(data)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

While your previous example was a plain data.frame (I called it df).

This plays a role in how your columns are extracted in the chisqmatrix code from here. Specifically:

m[i,j] = chisq.test(x[,i],x[,j],)$p.value

With a simple data.frame, selecting the first column gives you a numeric vector, which is what you want:

> df[,1]
[1] 0 0 1 1 1 1 0 1 0
> class(df[,1])
[1] "numeric"

While with a tibble, you will get another tibble:

> data[,1]
# A tibble: 9 x 1
     CA
  <dbl>
1     0
2     0
3     1
4     1
5     1
6     1
7     0
8     1
9     0
> class(data[,1])
[1] "tbl_df"     "tbl"        "data.frame"

So, two ways to deal with this. One way is to use double brackets (e.g., df[[1]]) which will extract a numeric vector:

> data[[1]]
[1] 0 0 1 1 1 1 0 1 0

Or, you can add drop = TRUE to coerce to a vector (e.g., df[,1, drop = TRUE]):

> data[,1,drop=T]
[1] 0 0 1 1 1 1 0 1 0

Because with a tibble the [ defaults to drop = FALSE:

https://tibble.tidyverse.org/reference/subsetting.html

So, here is the same function to reproduce your desired result:

chisqmatrix <- function(x) {
  names = colnames(x);  num = length(names)
  m = matrix(nrow=num,ncol=num,dimnames=list(names,names))
  for (i in 1:(num-1)) {
    for (j in (i+1):num) {
      #browser()
      m[j,i] = chisq.test(x[, i, drop = TRUE],x[, j, drop = TRUE])$p.value
    }
  }
  return (m)
}

mat <- chisqmatrix(data[c("CA", "Pos", "Mon", "Sc", "ood", "Eco")])
mat[-1, -ncol(mat)]

           CA       Pos       Mon        Sc       ood
Pos 0.2356799        NA        NA        NA        NA
Mon 1.0000000 1.0000000        NA        NA        NA
Sc  1.0000000 0.1441270 1.0000000        NA        NA
ood 0.5303348 1.0000000 1.0000000 1.0000000        NA
Eco 0.4220127 0.2399069 0.6669878 0.1562356 0.2959326

Note: I removed "Orp" as this will give an error from the sample data where there is only a single level; so you can replace with:

chisqmatrix(data[c("CA", "Pos", "Mon", "Sc", "ood", "Eco", "Orp")])

if using your full data.

Please let me know if this works for you.

Ben
  • 28,684
  • 5
  • 23
  • 45