Estimating an OLS model in R with million observations and thousands of variables

Question

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.

I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.

However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.

I've already tried:

starting R with --max-ppsize=50000
setting options(expressions = 50000)

but the error persists

I am working on Windows and using Rstudio

# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))

# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]

# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))

EDIT 1: The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.

what are you looking to get out of a 50,000 variable model? it might be worth looking into dimension reduction techniques before you model. — Mike, Apr 04 '19 at 20:12
You may want to use other methods than OLS, like penalized regression which can cope with many (correlated) predictors better than OLS http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/ — , Apr 04 '19 at 20:37
The ultimate goal of my exercise is indeed to estimate the average effect of each of the 50,000 covariates in the dependent variable. Therefore, simplifying the model with fewer variables is not an ideal solution. I am already splitting the estimation using chunks of rows. It would be awesome if there was any way to separate the estimation using subsets of columns as well, and then being able to recover an unbiased estimate for the effect of each variable, however I couldn't find anything like that yet. — Renato, Apr 04 '19 at 21:07
This doesn't seem valid since the coefficient of each variable also depents onn what other variables are in the model. Thus the mean coefficient of 2 diifferent models can be quite different compared to the mean coefficient one model with all the variables. — , Apr 04 '19 at 22:40
@igoR87 you are absolutely right, feature space partitioning seems to be much more complicated than sample space partitioning. I found the thesis below from a Duke student that suggests a set of solutions for this type of problem. However, before trying to implement (and to understand) something like that, I was hoping that maybe there is a solution more easily available already https://dukespace.lib.duke.edu/dspace/bitstream/handle/10161/12912/Wang_duke_0066D_13692.pdf?sequence=1&isAllowed=y — Renato, Apr 04 '19 at 23:22
It seems unlikely you wouldn't have multicollinearity issues dealing with 50K variables. — hmhensen, Apr 04 '19 at 23:44
As for your error, this might help: https://stackoverflow.com/questions/214741/what-is-a-stackoverflowerror. — hmhensen, Apr 04 '19 at 23:52

score 0 · Accepted Answer · answered Apr 05 '19 at 00:21

The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get

Error: cannot allocate vector of size 9.3 Gb

(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.

I have a few caveats about this approach:

you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.

Example (first save the code at the bottom as my_biglm.R):

nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))

respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)

# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))

source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))

my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
                      y=NULL, X=NULL, off=0) {
    if (!is.null(weights)) {
        if (!inherits(weights, "formula")) 
            stop("`weights' must be a formula")
        w <- model.frame(weights, data)[[1]]
    } else w <- NULL
    if (is.null(X)) {
        tt <- terms(formula)
        mf <- model.frame(tt, data)
        if (is.null(off <- model.offset(mf))) 
            off <- 0
        mm <- model.matrix(tt, mf)
        y <- model.response(mf) - off
    } else {
        ## model matrix specified directly
        if (is.null(y)) stop("both y and X must be specified")
        mm <- X
        tt <- NULL
    }
    qr <- biglm:::bigqr.init(NCOL(mm))
    qr <- biglm:::update.bigqr(qr, mm, y, w)
    rval <- list(call = sys.call(), qr = qr, assign = attr(mm, 
        "assign"), terms = tt, n = NROW(mm), names = colnames(mm), 
        weights = weights)
    if (sandwich) {
        p <- ncol(mm)
        n <- nrow(mm)
        xyqr <- bigqr.init(p * (p + 1))
        xx <- matrix(nrow = n, ncol = p * (p + 1))
        xx[, 1:p] <- mm * y
        for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
        xyqr <- update(xyqr, xx, rep(0, n), w * w)
        rval$sandwich <- list(xy = xyqr)
    }
    rval$df.resid <- rval$n - length(qr$D)
    class(rval) <- "biglm"
    rval
}

Thank you so much! The formula limit was solved. But, as you expected, I ran into RAM limit with 50k vars. About your points 1 & 3: -I dont have 10G available - About the p>n within chunks, I tested with my data a reduced version of the problem (1k rows, 200 vars, 50rows/chunk), and the chunked estimation returned the same results as estimating all at once. However, my real data is sparse (>80% zeros). I then tested it with a dense matrix, and results are very different, with chunked estimation returning lots of NAs I will look for alternative solutions to my problem — Renato, Apr 05 '19 at 12:20

Estimating an OLS model in R with million observations and thousands of variables

1 Answers1