1

I've seen several cases of this error, but none of them seem to solve or apply to my situation.

I am building a logistic regression model with biglm.

I have a data.frame with ~250 variables and a little over a million rows.

Since bigglm() doesn't work with the dot notation to select all variables in the model I am building my formula like this.

So if f is my formula and df is my dataframe, then my model looks like this:

fit <- bigglm(f, data = df, family=binomial(link="logit"), chunksize=100, maxit=10)

And I get the error: variable lengths differ (found for 'x')

When I check for length of x it is exactly the same as length of df.

Other StackOverflow questions seem to suggest it might be a problem with the way the formula is constructed. Or perhaps it is a problem with biglm?

Community
  • 1
  • 1
jgozal
  • 1,480
  • 6
  • 22
  • 43
  • Perhaps you could edit your question to add the formula? Otherwise there is not much for us to work with. – Bryan Hanson Jan 05 '16 at 03:43
  • @BryanHanson, the formula contains 250 variable names. Perhaps I could add a shortened version of it? I added the method I used to create it in hope that it would help understand how it looked. What do you suggest? – jgozal Jan 05 '16 at 03:45
  • 1
    Hmm... Do all variables have the same length? No missing values? That might cause your error. I guess you can troubleshoot by doing a binary search with your independent variables (i.e. use just the first half of the variables, if there is no error, try with the 2nd half, then divide the offending half into two for testing, and repeat until you find the bad variable). – Bryan Hanson Jan 05 '16 at 03:51
  • Here's something interesting I notice. `x` is always the first variable in the formula (x[1]). If I NULL `x`, then x[2] becomes x[1] so the error now says `variable lengths differ (found for 'x')` for x[2] from the other formula if that makes sense. All variables have the same lengths, there are certainly NAs in the dataframe but is that a problem? I woudn't like to get rid of all rows containing NAs since they might contain other valuable information – jgozal Jan 05 '16 at 03:57
  • I just installed the package, and `?bigglm` is pretty cryptic about how `NA` are handled (you might try passing `na.omit = TRUE` and it might fall through and be accepted by the underlying functions). Also, you might try `na.omit(your data frame)` and use that, if only for troubleshooting. Otherwise, I'm not sure what to try. – Bryan Hanson Jan 05 '16 at 04:06
  • I was able to find the issue, please read answer below. Thank you for your help! – jgozal Jan 05 '16 at 04:16

1 Answers1

0

I was able to solve this issue by making a slight modification in the way I was constructing my formula for bigglm()

As shown in the link attached in my question, I was constructing the formula like this:

n <- names(df)
f <- as.formula(paste("y ~", paste(n[!n %in% "y"], collapse = " + ")))

What f was missing was the df$ before each variable name in the formula. Modifying the as.formula() function to concatenate "df$"to each variable name fixed this issue.

jgozal
  • 1,480
  • 6
  • 22
  • 43
  • Or, you could can usually add `data = df` to your call, and the function will look in `df` for your variable names. Either way, you fixed it! – Bryan Hanson Jan 05 '16 at 04:19
  • oh I actually had `data = df` in there. My dataframe is named `data`. My mistake, I'll modify the question. But still strange how that didn't work at first – jgozal Jan 05 '16 at 04:21