I am trying to estimate an lm()
fitment in R for a large sales dataset. The data itself is not so large that R cannot handle it; about 250MB in memory. The problem is when lm()
is invoked to include all variables and cross-terms, the construction of the model.matrix()
throws the error saying that the machine has run out of memory and cannot allocate the vector of size whatever (in this case, about 47GB). Understandable, I don't have that much RAM. The problem is, I have tried the ff
, bigmemory
, and filehash
packages, all of which work fine for working outside of memory with the existing files (I particularly like the database functions of filehash
). But I cannot, for the life of me, get the model.matrix
to be created at all. I think the issue is that, despite mapping the output file to the database I created, R tries to set it up in RAM anyway, and can't. Is there a way to avoid this using these packages, or am I doing something wrong?
[Also, using biglm
and other functions to do things chunk-wise doesn't even allow me to chunk by one at a time. Again, it seems R is trying to make the WHOLE model.matrix
first, before chunking it]
Any help would be greatly appreciated!
library(filehash)
library(ff)
library(ffbase)
library(bigmemory)
library(biganalytics)
library(dummies)
library(biglm)
library(dplyr)
library(lubridate)
library(data.table)
SID <- readRDS('C:\\JDA\\SID.rds')
SID <- as.data.frame(unclass(SID)) # to get characters as Factors
dbCreate('reg.db')
db <- dbInit('reg.db')
dbInsert(db, 'SID', SID)
rm(SID)
gc()
db$summary1 <-
db$SID %>%
group_by(District, Liable, TPN, mktYear, Month) %>%
summarize(NV.sum = sum(NV))
start.time <- Sys.time()
# Here is where it throws the error:
db$fit <- lm(NV.sum ~ .^2, data = db$summary1)
Sys.time() - start.time
rm(start.time)
gc()
summary(fit)
anova(fit)