I have a large data set I working with in R using some of the big.___()
packages. It's ~ 10 gigs (100mmR x 15C) and looks like this:
Price Var1 Var2
12.45 1 1
33.67 1 2
25.99 3 3
14.89 2 2
23.99 1 1
... ... ...
I am trying to predict price based on Var1 and Var2.
The problem I've come up with is that Var1 and Var2 are categorical / factor variables.
Var1 and Var2 each have 3 levels (1,2 and 3) but there are only 6 combinations in the data set
(1,1; 1,2; 1,3; 2,2; 2,3; 3,3)
To use factor variables in biglm()
they must be present in each chunk of data that biglm
uses (my understanding is that biglm
breaks the data set into 'x' number of chunks and updates the regression parameters after analyzing each chunk in order to get around dealing with data sets that are larger than RAM).
I've tried to subset the data but my computer can't handle it or my code is wrong:
bm11 <- big.matrix(150000000, 3)
bm11 <- subset(x, x[,2] == 1 & x[,3] == 1)
The above gives me a bunch of these:
Error: cannot allocate vector of size 1.1 Gb
Does anyone have any suggestions for working around this issue?
I'm using R 64-bit on a windows 7 machine w/ 4 gigs of RAM.