4

We are trying to use the BigMemory library with foreach to parallel our analysis. However, the as.big.matrix function seems always use backingfile. Our workstations have enough memory, is there a way to use bigMemory without the backing file?

This code x.big.desc <-describe(as.big.matrix(x)) is pretty slow as it write the data to C:\ProgramData\boost_interprocess\. Somehow it is slower than save x directly, is it as.big.matrix that have a slower I/O?

This code x.big.desc <-describe(as.big.matrix(x, backingfile = "")) is pretty fast, however, it will also save a copy of the data to %TMP% directory. We think the reason it is fast, because R kick off a background writing process, instead of actually writing the data. (We can see the writing thread in TaskManager after the R prompt returns).

Is there a way to use BigMemory with RAM only, so that each worker in foreach loop can access the data via RAM?

Thanks for the help.

user7648269
  • 111
  • 1
  • From the vignette, it appears that a file-backing is not necessary: *The data structures may be allocated to shared memory, allowing separate processes on the same computer to share access to a single copy of the data set. The data structures may also be fille-backed, allowing users to easily manage and analyze data sets larger than available RAM and share them across nodes of a cluster.* – lmo Aug 18 '17 at 15:36
  • You can use a `big.matrix` in RAM by specifying `shared = FALSE`. Yet, it won't be shared between processes so that you should use a standard matrix instead, which be copied to each cluster. What is the problem with data stored on disk? – F. Privé Aug 18 '17 at 17:16
  • lmo: Though the document says the backing file is not necessary, but in practice it always used it. Prive: with Shared = FALSE, we can not use the foreach parallel processing, the workers won't be able to access the data. Using disk is slow, our data is big but we have large RAM, we try to avoid loading from data from disks, if we can use RAM only. – user7648269 Aug 18 '17 at 17:40
  • If you have enough memory, why not using standard R matrices then? – F. Privé Aug 18 '17 at 21:03
  • @F.Privé we want to use foreach to parallel the analysis, but passing the bit matrix to each worker is very slow, so we want to use bigMemory, so each work only copy the column it needs. Also we have enough memory for the data, but not enough RAM if we have 20 works and each worker obtain a copy of the data. Thanks for replying. – user7648269 Aug 18 '17 at 22:51

1 Answers1

0

So, if you have enough RAM, just use standard R matrices. To pass only a part of each matrix to each cluster, use rdsfiles.

One example computing the colSums with 3 cores:

# Functions for splitting
CutBySize <- function(m, nb) {
  int <- m / nb

  upper <- round(1:nb * int)
  lower <- c(1, upper[-nb] + 1)
  size <- c(upper[1], diff(upper))

  cbind(lower, upper, size)
}
seq2 <- function(lims) seq(lims[1], lims[2])

# The matrix
bm <- matrix(1, 10e3, 1e3)
ncores <- 3
intervals <- CutBySize(ncol(bm), ncores)
# Save each part in a different file
tmpfile <- tempfile()
for (ic in seq_len(ncores)) {
  saveRDS(bm[, seq2(intervals[ic, ])], 
          paste0(tmpfile, ic, ".rds"))
}
# Parallel computation with reading one part at the beginning
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
library(foreach)
colsums <- foreach(ic = seq_len(ncores), .combine = 'c') %dopar% {
  bm.part <- readRDS(paste0(tmpfile, ic, ".rds"))
  colSums(bm.part)
}
parallel::stopCluster(cl)
# Checking results
all.equal(colsums, colSums(bm))

You could even use rm(bm); gc() after writing parts to the disk.

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • Prive, This is exactly our currently method. It would be nicer if we can use BigMemory to save the data, and have each worker copy the block of the data it needs, instead of write/read RDS files. It would be useful if BigMemory can be used as shared-memory, that foreach works can copy data from there. Thanks. – user7648269 Aug 21 '17 at 15:33