8

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).

Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():

fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5, 
       data=sqlQuery(myconn,train_dat),family=binomial(link="logit"), 
       chunksize=1000, maxit=10)

Error in coef.bigqr(object$qr) : 
NA/NaN/Inf in foreign function call (arg 3)

> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D, 
    bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar), 
    ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)

bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).

Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
ybeybe
  • 149
  • 1
  • 12

3 Answers3

16

I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.

bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.

What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.

hope this helps (at least somebody)

4

Ok so we were able to find the cause for this problem:

for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.

I'll do more research on how to deal with this kind of situation.

ybeybe
  • 149
  • 1
  • 12
  • you could probably just try `data=na.omit(qlQuery(myconn,train_dat))` – Ben Bolker Jun 20 '14 at 17:51
  • Thanks Ben. I tried, it doesn't work. Since na.omit eliminates any NAs in my data. but the problem is not NA in my data. The problem is that "bigglm" cannot estimate coefficient for one category. For example, X1 has three levels (1,2,3), X2 has two levels (1,2). but there's no observation with attributes X1-2 & X2-2. so the estimate for this attribute would be NA. this is fine, but somehow "bigglm" wouldn't run at all. – ybeybe Jun 20 '14 at 18:16
  • 1
    this is called [rank deficiency](http://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it) -- I don't know how to handle it in `bigglm` but at least you know what keyword to search for. – Ben Bolker Jun 20 '14 at 18:33
  • a little bit of searching doesn't bring up any obvious discussions. It might be hard -- maybe even worth contacting `biglm` maintainers. – Ben Bolker Jun 20 '14 at 18:41
  • Thanks Ben! I'll post here again if I ended up contacting the maintainers. – ybeybe Jun 20 '14 at 19:26
0

I met this error before, thought it was from randomForest instead of biglm. The reason could be the function cannot handle character variables, so you need to convert characters to factors. Hope this can help you.

Yoki
  • 863
  • 4
  • 14
  • 26
  • this should probably be a comment rather than an answer. – Ben Bolker Jun 20 '14 at 15:28
  • Thanks Yoki. I have all variables number or factor, no character. but your answer did enlighten us to find the cause for this problem (see below). Thanks again. – ybeybe Jun 20 '14 at 17:45