I have a function, as follows, that takes a design matrix X
with class type big.matrix
as input and predicts the responses.
NOTE: the size of matrix X
is over 10 GB. So I cannot load it into memory. I used read.big.matrix()
to generate backing files X.bin
and X.desc
.
myfun <- function(X) {
## do something with X. class(X) == 'big.matrix'
}
My question is that, how I can do cross validation efficiently with this huge big.matrix?
My attempt: (It works, but is time consuming.)
- Step 1: for each fold, get indices for training
idx.train
and testidx.test
; - Step 2: divide
X
intoX.train
andX.test
. SinceX.train
andX.test
are also very large, I have to store them asbig.matrix
, and create associated backing files (.bin
,.desc
) for the training and test sets for each fold. - Step 3: feed the
X.train
to build the model, and predict responses forX.test
.
The time-consuming part is Step 2, where I have to create backing files for training and test (almost like copy/paste the original big matrix) many times. For example, suppose I do 10-fold cross validation. Step 2 would take over 30 minutes for creating backing files for all 10 folds!
To solving this issue in Step 2, I think maybe I can divide the original matrix into 10 sub matrices (of class type big.matrix
) just once. Then for each fold, I use one portion for testing, and combine the remaining 9 portions as one big matrix for training. But the new issue is, there is no way to combine small big.matrix
into a larger one efficiently without copy/paste.
Of course I can do distributed computing for this cross validation procedure. But I am just wondering whether there is a better way to speed up the procedure if just using a single core.
Any ideas? Thanks in advance.
UPDATE:
It turns out that @cdeterman's answer doesn't work when X
is very large. The reason is that the mpermute()
function permutes the rows by essentially doing copy/paste. mpermute()
calls ReorderRNumericMatrix()
in C++, which then calls reorder_matrix()
function. This function reorders the matrix by looping over all columns and rows and doing copy/paste. See the source code here.
Are there any better ideas for solving my problem?? Thanks.
END UPDATE