5

I am working on a huge dataset and I would like to derive the distribution of a test statistic. Hence I need to do calculations with huge matrices (200000x200000) and as you might predict I have memory issues. More precisely I get the following: Error: cannot allocate vector of size ... Gb. I work on the 64-bit version of R and my RAM is 8Gb. I tried to use the package bigmemory but with not big success.

The first issue comes when I have to calculate the distance matrix. I found this nice function in amap package called Dist that calculates the distance of a columns of a dataframe on parallel and it works well, however it produces the lower/upper triangular. I need the distance matrix to perform matrix multiplications and unfortunately I cannot with half of the matrix. When use the as.matrix function to make it full, I have again memory issues.

So my question is how can I convert a dist object to a big.matrix by skipping the as.matrix step. I suppose that it might be an Rccp question, please have in mind that I am really new at Rccp.

Thanx in advance!

Community
  • 1
  • 1
Akis
  • 130
  • 1
  • 8
  • You might look at [MRO](https://mran.microsoft.com/) or [switching the BLAS on OS X CRAN R](https://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Which-BLAS-is-used-and-how-can-it-be-changed_003f). I'm not sure if it will help with your memory issue, but it will certainly speed up matrix operations. – alistaire Feb 18 '16 at 16:37
  • There is a similar `big.matrix` distance question [here](http://stackoverflow.com/questions/26958646/calculate-euclidean-distance-matrix-using-a-big-matrix-object) that may be of help. – cdeterman Feb 22 '16 at 14:04

1 Answers1

1

On converting a "dist" object to "(big.)matrix": stats:::as.matrix.dist has calls to row, col, t and operators that create large intermediate objects. Avoiding these you could, among other alternatives, use something like:

With data:

nr = 1e4
m = matrix(runif(nr), nr, 10)
d = dist(m)

Then, slowly, allocate and fill a "matrix":

#as.matrix(d) #this gives error on my machine
n = attr(d, "Size")
md = matrix(0, n, n) 
id = cumsum(c(1L, (n - 1L) - 0:(n - 2L))) #to split "d"
for(j in 1:(n - 1L)) {
    i = (j + 1L):n
    md[i, j] = md[j, i] = d[id[j]:(id[j] + (n - (j + 1L)))]
}

(It seems that with allocating "md" as big.matrix(n, n, init = 0) equally works)

md[2:5, 1]
#[1] 2.64625973 2.01071637 0.09207748 0.09346157
d[1:4]
#[1] 2.64625973 2.01071637 0.09207748 0.09346157

Using smaller "nr" we could test:

all.equal(as.matrix(md), as.matrix(d), check.attributes = FALSE)
#[1] TRUE
alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • thanks a lot alex, that sounds feasible. Let me try it and I will come back with feedback :) – Akis Feb 19 '16 at 11:41