6

I am new to R and I am doing some logistics regression model. I am trying to run bigglm against my data of 2M records with 100+ variables. My variables are composed of numeric and integers (0/1) as I have set it as indicators e.g.

isOK,quantity,weight,isUS,isEU,isASIA
0,2,1.1,0,0,1
1,1,0.9,1,1,0

However, bigglm always throw an error

Error in coef.bigqr(object$qr) : NA/NaN/Inf in foreign function call (arg 3)

From traceback(), it shows the following

14: coef.bigqr(object$qr)
13: coef(object$qr)
12: coef.biglm(iwlm)
11: coef(iwlm)
10: bigglm.function(formula = formula, data = datafun, ...)
9: bigglm(formula = formula, data = datafun, ...)
8: bigglm(formula = formula, data = datafun, ...)
7: bigglm.data.frame(myForm, data = myraw.data[i, , drop = FALSE], 
   family = binomial(link = logit))
6: bigglm(myForm, data = myraw.data[i, , drop = FALSE], family = binomial(link = logit))
5: bigglm(myForm, data = myraw.data[i, , drop = FALSE], family = binomial(link = logit)) at trial.r#48
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("trial.r")

I have done some research and it was mentioned that bigglm should have all the possible values/factors in the chunk, however, all my variables are numeric/indicator and I think this is not necessary (please correct me if I'm mistaken). Anyway, I have already rearranged my data set in such a way that the first chunk (for my case, I set it as 3000 as per below), all integer variables has records where it is 0 or 1.

for (i in chunk(myraw.data, by=3000)){
  if (i[1]==1){
      myFullLRModel <- bigglm(myForm, data=myraw.data[i,,drop=FALSE], family=binomial(link=logit))
  }else{
      myFullLRModel <- update(myFullLRModel, myraw.data[i,,drop=FALSE])
  }
}

Would you be able to advise on why the said error is occurring? I cannot run glm as it always returns insufficient memory.

Agaz Wani
  • 5,514
  • 8
  • 42
  • 62
oim
  • 1,141
  • 10
  • 14

1 Answers1

0

If you had 0/1 variables and in one chunk no zeros were observed (i.e., the column was constant) that would result in an NA coefficient, which could produce the problem that you're facing.

DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25