1

I have a function, as follows, that takes a design matrix X with class type big.matrix as input and predicts the responses.

NOTE: the size of matrix X is over 10 GB. So I cannot load it into memory. I used read.big.matrix() to generate backing files X.bin and X.desc.

myfun <- function(X) {
## do something with X. class(X) == 'big.matrix'
}

My question is that, how I can do cross validation efficiently with this huge big.matrix?

My attempt: (It works, but is time consuming.)

  • Step 1: for each fold, get indices for training idx.train and test idx.test;
  • Step 2: divide X into X.train and X.test. Since X.train and X.test are also very large, I have to store them as big.matrix, and create associated backing files (.bin, .desc) for the training and test sets for each fold.
  • Step 3: feed the X.train to build the model, and predict responses for X.test.

The time-consuming part is Step 2, where I have to create backing files for training and test (almost like copy/paste the original big matrix) many times. For example, suppose I do 10-fold cross validation. Step 2 would take over 30 minutes for creating backing files for all 10 folds!

To solving this issue in Step 2, I think maybe I can divide the original matrix into 10 sub matrices (of class type big.matrix) just once. Then for each fold, I use one portion for testing, and combine the remaining 9 portions as one big matrix for training. But the new issue is, there is no way to combine small big.matrix into a larger one efficiently without copy/paste.

Of course I can do distributed computing for this cross validation procedure. But I am just wondering whether there is a better way to speed up the procedure if just using a single core.

Any ideas? Thanks in advance.

UPDATE:

It turns out that @cdeterman's answer doesn't work when X is very large. The reason is that the mpermute() function permutes the rows by essentially doing copy/paste. mpermute() calls ReorderRNumericMatrix() in C++, which then calls reorder_matrix() function. This function reorders the matrix by looping over all columns and rows and doing copy/paste. See the source code here.

Are there any better ideas for solving my problem?? Thanks.

END UPDATE

SixSigma
  • 2,808
  • 2
  • 18
  • 21

1 Answers1

3

You will want to use the sub.big.matrix function. This avoids any further copies and points the same original data. However, it can currently only subset contiguous rows. So you will want to permute your rows first.

# Step 1 - generate random indices
idx <- sample(nrow(X), nrow(X))
mpermute(X, idx)

# Step 2 - create your folds
max <- nrow(bm)/10 # assuming 10 folds
idx_list <- split(seq(nrow(bm)), ceiling(seq(nrow(bm))/max))

# Step 3 - list of sub.big.matrix objects
sm_list <- lapply(idx_list, function(x) sub.big.matrix(bm, firstRow = x[1], lastRow = x[length(x)]))

You now have the original big.matrix split into 10 different matrices that you can use as you like.

cdeterman
  • 19,630
  • 7
  • 76
  • 100
  • Thanks @cdeterman. This is really good to know. However, this doesn't solve the issue completely. I still have to figure out how to combine the 9 sub big.matrix into one for training. – SixSigma Nov 09 '15 at 20:38
  • Alternatively, based on your suggestion, it seems that I can ``mpermute`` the matrix each time by putting the test portion (where idx == fold id) to the bottom and then can use ``sub.big.matrix`` to get training and test parts. But since ``mpermute`` CHANGES the original matrix, this seems not working as well. – SixSigma Nov 09 '15 at 20:47
  • The second approach seems work if I keep updating the index vector used in ``mpermute``. Thank you very much, @cdeterman. – SixSigma Nov 09 '15 at 20:54
  • Hi @cdeterman, it turns out that your approach doesn't work because ``mpermute()`` takes a long time. I am wondering whether you have other ideas for this problem. Thanks a lot. – SixSigma Nov 10 '15 at 18:06
  • @AaronZeng at this point this is still probably the best solution. It may take a while but at least it will fit RAM for you and you can ultimately split the matrix up. Non-contiguous subsetting just currently isn't supported. Feel free to submit an issue [here](https://github.com/kaneplusplus/bigmemory/issues) – cdeterman Nov 10 '15 at 18:11
  • Shouldn't it be `sample(nrow(bm), nrow(bm))` and `mpermute(bm, idx)` in the first lines? – Espen Riskedal Nov 22 '18 at 11:57