3

I have a 100k row dataframe on which I want to compute a Cochran–Mantel–Haenszel test.

My variables are the educational level and a computed score factored in quantiles, and my grouping variable is the sex, and the code line looks like this :

mantelhaen.test(db$education, db$score.grouped, db$sex)

This code throws this error and warning :

Error in qr.default(a, tol = tol) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message: In ntot * rowsums : NAs produced by integer overflow

The error seems to be caused by my first variable, as on 7 variables tested I got the problem with only 2 of them, which seems to share no obvious common thing.

Missing values and factor levels don't seem to differ between variables which throws error and variable which doesn't. I tried with complete cases (with na.omit) and the problem persists.

What does trigger this error ? does it mean ?
How can I get rid of it ?

Interesting posts : R: NA/NaN/Inf in foreign function call (arg 1), What is integer overflow in R and how can it happen?

ADDENDUM : Here is the result of str (failures are education and imc.cl):

str(db[c("education","score.grouped","sex", ...)])
'data.frame':   104382 obs. of  7 variables:
 $ age.cl: Ord.factor w/ 5 levels "<30 ans"<"30-40 ans"<..: 5 2 1 1 3 4 2 3 4 4 ...
  ..- attr(*, "label")= chr "age"
 $ emploi2          : Factor w/ 8 levels "Agriculteurs exploitants",..: 3 5 6 8 8 8 8 3 3 3 ...
  ..- attr(*, "label")= chr "CSP"
 $ tabac            : Factor w/ 4 levels "ancien fumeur",..: 4 1 4 4 3 4 4 1 4 4 ...
  ..- attr(*, "label")= chr "tabac"
 $ situ_mari2       : Factor w/ 3 levels "Vit seul","Divorsé, séparé ou veuf",..: 3 2 1 1 1 3 1 3 2 3 ...
  ..- attr(*, "label")= chr "marriage"
 $ education        : Factor w/ 3 levels "Universitaire",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ revenu.cl        : Factor w/ 4 levels "<1800 euros/uc",..: 3 4 2 NA 4 1 1 4 4 1 ...
 $ imc.cl           : Ord.factor w/ 6 levels "Maigre"<"Normal"<..: 2 2 1 2 3 1 3 2 2 3 ...
  ..- attr(*, "label")= chr "IMC"

EDIT : by diving inside the function, the error and warning are caused by a call to qr.solve. I don't understand anything about this but I'll try to dive deeper
EDIT2 : inside qr.solve, the error is thrown by a Fortran call to .F_dqrdc2. This is so much beyond my level my nose is starting to bleed.
EDIT3 : I tried to head my data to find out which line is in cause :

db2 = db %>% head(99787)   #fails at 99788
db2 = db %>% tail(99698)   #fails at 99699
mantelhaen.test(db2$education, db2$score.grouped, db2$sex)

This gives me not much information but maybe it could give you.

Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92
  • 2
    The product of two integer numbers gets too large to be an integer, e.g., `10L * 2147483646L` reproduces the warning. That happens as a result of properties of your data. Without a reproducible example we can't really help more. – Roland Jan 24 '18 at 12:38
  • Yes, I saw some posts about this warning, this is what I understood. But how can I give a 100k rows reproducible example ? – Dan Chaltiel Jan 24 '18 at 12:42
  • @Roland could you please take a look at my edits ? Have you got an idea of how can I give you a reproducible example ? – Dan Chaltiel Jan 24 '18 at 13:25
  • can you show us `str(db[c("education","score.grouped","sex")])` ? Can you write code to randomly generate a data set with 100K rows with similar characteristics that causes the same error? – Ben Bolker Jan 24 '18 at 13:43
  • @BenBolker I added the `str` output. Sorry but I'm affraid I cannot generate them, I don't even know where to start for this. Please note that I tried 7 variables and only education and another throwed this error. – Dan Chaltiel Jan 24 '18 at 13:52
  • OK, can you edit to post `str(db)` (i.e. all of the variables) and tell us which other one failed? – Ben Bolker Jan 24 '18 at 13:56
  • This is what I tried to reproduce (it worked though): `set.seed(101); n <- 78000; db <- data.frame(education= factor(sample(1:3,replace=TRUE,size=n)), score= factor(sample(1:5,replace=TRUE,size=n)), sex= sample(c("M","F"),replace=TRUE,size=n)); mantelhaen.test(db$education, db$score, db$sex)` – Ben Bolker Jan 24 '18 at 13:57
  • @BenBolker Yes, no problem to make an example of a working test, but I have no clue about how to reproduce the error (since I don't know what cause it). I made the edits in `str` output. Thanks for your help ! – Dan Chaltiel Jan 24 '18 at 14:02
  • one more hint: I think the problem probably starts at your warning (the overflow warning) - the stuff inside `qr.solve()` is downstream of that. With this big a data set you could probably try `exact=FALSE` ... – Ben Bolker Jan 24 '18 at 14:10
  • @BenBolker `exact=FALSE` is the default value. I tried with `correct=FALSE` but the error persists. – Dan Chaltiel Jan 24 '18 at 14:27

1 Answers1

3

I was able to replicate the problem by making the data set bigger.

set.seed(101); n <- 500000
db <- data.frame(education=
                   factor(sample(1:3,replace=TRUE,size=n)),
                 score=
                   factor(sample(1:5,replace=TRUE,size=n)),
                 sex=
                   sample(c("M","F"),replace=TRUE,size=n))

After this, mantelhaen.test(db$education, db$score, db$sex) gives the reported error.

Thankfully, the real problem is not within the guts of the QR decomposition code: rather, it occurs when setting up a matrix prior to QR decomposition. There are two computations, ntot*colsums and ntot*rowsums, that overflow R's capacity for integer computation. There's a relatively easy way to work around this by creating a modified version of the function:

  • copy the source code: dump("mantelhaen.test",file="my_mh.R")
  • edit the source code
    • l. 1: modify name of function to my_mantelhaen.test (to avoid confusion)
    • lines 199 and 200: change ntot to as.numeric(ntot), converting the integer to double precision before the overflow happens
  • source("my_mh.R") to read in the new function

Now

my_mantelhaen.test(db$education, db$score, db$sex)  

should work. You should definitely test the new function against the old function for cases where it works to make sure you get the same answer.

Now posted to the R bug list, we'll see what happens ...

update 11 May 2018: this is fixed in the development version of R (3.6 to be).

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453