Rfast segmentation fault on independence test

Question

I am having troubles using the G2-test function of the Fast function in R since it outputs a segmentation fault even though it seems to me that the input parameters are correct.

More specifically, I am able to run the example code in the manual page

nvalues <- 3
nvars <- 10
nsamples <- 5000
data <- matrix( sample( 0:(nvalues - 1), nvars * nsamples, replace = TRUE ), nsamples, nvars )
dc <- rep(nvalues, nvars)

res<-g2Test( data, 1, 2, 3, c(3, 3, 3) )

But I'm not able to make it run on my data. The function g2Test takes as input a matrix of numbers, three integer that stands for the column on which to condition (in the example we are studying the dependence of the first on the second conditioned on the third) and a vector with the number of unique values per column.

My code follows the same principles reading data from the ALARM csv file

library(readr)
library(Rfast)

# open the file
path <-  "datasets/alarm.csv"
dataset <- read.csv(path)
# search for the indexes of the column I'm interested in and the amount of unique values per column
c1 <- "PVS"
c2 <- "ACO2"
s <- c("VALV", "VLNG", "VTUB",   "VMCH")
n <- colnames(dataset) 
col_c1 <- match(c1, n)
col_c2 <- match(c2, n)
cols_c3 <- c()
uni <- c(length(unique(dataset[c1])[[1]])[[1]],length(unique(dataset[c2])[[1]])[[1]])
if (!s[1]=="()"){
 for(v in s){
   idx <- match(v, n)
   cols_c3 <- append(cols_c3,idx)
   uni <- append(uni,length(unique(dataset[v])[[1]])[[1]])
 }
}
# transforming the str DataFrame into a integer matrix
for (nn in n){
  dataset[nn] <- unclass(as.factor(dataset[nn][[1]]))
}
ds <- as.matrix(dataset)
colnames(ds) <- NULL

# running the G2 test
res <- g2Test(ds, col_c1, col_c2, cols_c3, uni)

But it results into a segmentation fault

 *** caught segfault ***
address 0x1f103f96a, cause 'memory not mapped'

Traceback:
 1: g2Test(ds, col_c1, col_c2, cols_c3, uni)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

The same happens if I condition on just one variable and not on multiple ones.

I really don't understand why this happens since it seems to me that my case is the same as the example on the reference, just with different data. I would really appreciate any help for debugging this issue, please tell me if I need to specify further infos.

It looks like you're fairly new to SO; welcome to the community! If you want great answers quickly, it's best to make your question reproducible. This includes sample data (e.g., data.frame(x=...,y=...) like the output from dput(head(dataObject))). Check it out: [making R reproducible questions](https://stackoverflow.com/q/5963269). The chances are that it has something to do with your data, so it's pretty difficult to help without it. — Kat, Jan 17 '22 at 05:04
Thanks for the tip! I already think it is reproducible since there is the code and there is the link to the input file as well, you probably missed that part while reading the question. I'm not sure it is a data-related problem, I think it is a problem due to the linkage between R and the cpp libraries that perform the computation but I have no way to solve this issue by myself — DaSim, Jan 17 '22 at 09:07

score 1 · Accepted Answer · answered Jan 17 '22 at 15:52

First, I'm sorry that I missed that you had originally included your data!

Alright, I wish I would have realized this sooner (as you will, as well...). The columns have to be consecutive and the values must start at zero. So what does that mean? You have to rearrange the columns so that col_c1 is the first column, col_c2 is the second column, and so on. You have to subtract all values by one (since the lowest value is 1).

This is what I did (and how I checked it):

# there was no PVS, I assume this was PVSAT
c1 <- "PVSAT"
# c1 <- "PVS"

# there was no ACO2, I assume this was ARTCO2
c2 <- "ARTCO2"
# c2 <- "ACO2"

# there are no columns with these names...
# for VALV - VENTALV; for VLNG - VENTLUNG; for VTUB - VENTTUBE; for VMCH - VENTMACH
s <- c("VENTALV", "VENTLUNG", "VENTTUBE", "VENTMACH")
# s <- c("VALV", "VLNG", "VTUB", "VMCH")

This next chunk is exactly as you wrote it:

n <- colnames(dataset) 

col_c1 <- match(c1, n)
col_c2 <- match(c2, n)

cols_c3 <- c()

uni <- c(length(unique(dataset[c1])[[1]])[[1]],length(unique(dataset[c2])[[1]])[[1]])

if (!s[1]=="()"){
  for(v in s){
    idx <- match(v, n)
    cols_c3 <- append(cols_c3,idx)
    uni <- append(uni,length(unique(dataset[v])[[1]])[[1]])
  }
}
# transforming the str DataFrame into a integer matrix
for (nn in n){
  dataset[nn] <- unclass(as.factor(dataset[nn][[1]]))
}

ds <- as.matrix(dataset)

This is where I made the minimum zero:

# look at the number of unique values before changing, as a means of validation
sapply(1:ncol(ds), function(x) length(unique(ds[, x])))
# look at the minimum, as a means of validation
sapply(1:ncol(ds), function(x) min(ds[,x]))
# the minimum value must be zero
ds <- ds - 1
# check
sapply(1:ncol(ds), function(x) min(ds[,x]))
sapply(1:ncol(ds), function(x) length(unique(ds[, x])))

# looked as expected

Next, I rearranged the columns. I did this before removing the names so I could use the names to ensure the order was correct.

# the data must be consecutive numbers
# catch names before and after
n2 <- dimnames(ds)
# some of the results from this:
# [[2]]
#  [1] "HISTORY"      "CVP"          "PCWP"         "HYPOVOLEMIA"

# create the list of column indicies other than those getting called in g2Test
tellMe <- c(1:ncol(ds))
tellMe <- tellMe[-c(col_c1, col_c2, sort(cols_c3))] 

# rearrange using the indices
ds <- ds[, c(col_c1, col_c2, sort(cols_c3), tellMe)]

# check it
(n3 <- dimnames(ds))
# some of the results from this
# [[2]]
#  [1] "PVSAT"        "ARTCO2"       "VENTMACH"     "VENTTUBE"

All that's left is removing the names (just as you did) and then calling the function. Since the indices changed, your objects won't work here, though.

colnames(ds) <- NULL

# running the G2 test
# res <- g2Test(ds, col_c1, col_c2, sort(cols_c3), uni)
res2 <- g2Test(ds, 1, 2, c(3,4,5,6), c(3, 3, 4, 4, 4, 4))
# $statistic
# [1] 19.78506
# 
# $df
# [1] 1024
#

Sorry for the column names since I downloaded the dataset from another source and expected it had the same header. I was about to ask you where did you find all those infos but then I read the doc another time and got my question answered. Your answer however is great: clear, complete and kind. You totally deserve the points for the answer, thanks a lot (: — DaSim, Jan 18 '22 at 08:55
Hey @Kat, I have [this other question](https://stackoverflow.com/q/70770281/8156843) that is strictly related. Could you have a look at it please? I think you are the best person to try to answer it. Thanks — DaSim, Jan 19 '22 at 12:22

Rfast segmentation fault on independence test

1 Answers1