2

I am trying to estimate an lm() fitment in R for a large sales dataset. The data itself is not so large that R cannot handle it; about 250MB in memory. The problem is when lm() is invoked to include all variables and cross-terms, the construction of the model.matrix() throws the error saying that the machine has run out of memory and cannot allocate the vector of size whatever (in this case, about 47GB). Understandable, I don't have that much RAM. The problem is, I have tried the ff, bigmemory, and filehash packages, all of which work fine for working outside of memory with the existing files (I particularly like the database functions of filehash). But I cannot, for the life of me, get the model.matrix to be created at all. I think the issue is that, despite mapping the output file to the database I created, R tries to set it up in RAM anyway, and can't. Is there a way to avoid this using these packages, or am I doing something wrong? [Also, using biglm and other functions to do things chunk-wise doesn't even allow me to chunk by one at a time. Again, it seems R is trying to make the WHOLE model.matrix first, before chunking it]

Any help would be greatly appreciated!

library(filehash)
library(ff)
library(ffbase)
library(bigmemory)
library(biganalytics)
library(dummies)
library(biglm)
library(dplyr)
library(lubridate)
library(data.table)



SID <- readRDS('C:\\JDA\\SID.rds')
SID <- as.data.frame(unclass(SID)) # to get characters as Factors

dbCreate('reg.db')
db <- dbInit('reg.db')
dbInsert(db, 'SID', SID)
rm(SID)
gc()

db$summary1 <-
  db$SID %>%
  group_by(District, Liable, TPN, mktYear, Month) %>%
  summarize(NV.sum = sum(NV))

start.time <- Sys.time()
# Here is where it throws the error:
db$fit <- lm(NV.sum ~ .^2, data = db$summary1)
Sys.time() - start.time
rm(start.time)
gc()

summary(fit)
anova(fit)
Ryan Price
  • 131
  • 8
  • Try using a sparse matrix from the `Matrix` package. – Neal Fultz Dec 09 '14 at 18:00
  • How many variables do you have? How many observations? – MrFlick Dec 09 '14 at 18:00
  • @NealFultz, that did the trick, thank you. The matrix is now occupying memory space. However (and this may be rudimentary), I had planned on finding a way to construct a `model.matrix` first, then just plug that in to an `lm` function. I was throwing errors earlier; is there a way to do that, or to specify the `model.matrix` creation in the `lm` call to make it a sparse one? @MrFlick, there are about 750,000 observations in the original dataset, about 113,000 in the aggregated one (the one I'm using), and 6 variables, 5 of which are factors, and 2 of which have over 100 levels. – Ryan Price Dec 09 '14 at 18:10
  • Generally buying more RAM would help. If you have problems at this stage, they will probably appear also later on with more complicated operations. – Tim Dec 09 '14 at 18:47
  • 2
    It's the 100 factors by 100 factors interaction term that is blowing things up. – IRTFM Dec 09 '14 at 20:03

1 Answers1

0

This is based on the example from solve-methods in the Matrix package:

> ?`solve-methods`
> n1 <- 7; n2 <- 3
> dd <- data.frame(a = gl(n1,n2), b = gl(n2,1,n1*n2))# balanced 2-way
> X <- sparse.model.matrix(~ -1+ a + b, dd)# no intercept --> even sparser
> Y <- rnorm(nrow(X))
> # Forming normal equations manually and solving for beta-hat 
> solve(crossprod(X), crossprod(X, Y))
9 x 1 Matrix of class "dgeMatrix"
            [,1]
 [1,]  1.2384385
 [2,]  1.3313779
 [3,]  0.7497135
 [4,]  0.7840841
 [5,]  0.9586135
 [6,]  0.4667769
 [7,]  1.6648260
 [8,] -1.6669776
 [9,] -1.1142240
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60