3

Problem description

I have 45000 short time series (length 9) and would like to compute the distances for a cluster analysis. I realize that this will result in (the lower triangle of) a matrix of size 45000x45000, a matrix with more than 2 billion entries. Unsurprisingly, I get:

> proxy::dist(ctab2, method="euclidean")
Error: cannot allocate vector of size 7.6 Gb

What can I do?

Ideas

  • Increase available/addressable memory somehow? However, these 7.6G are probably beyond some hard limit that cannot be extended? In any case, the system has 16GB memory and the same amount of swap. By "Gb", R seems to mean Gigabyte, not Gigabit, so 7.6Gb puts us already dangerously close to a hard limit.
  • Perhaps a different distance computation method instead of euclidean, say DTW, might be more memory efficient? However, as explained below, the memory limit seems to be the resulting matrix, not the memory required at computation time.
  • Split the dataset into N chunks and compute the matrix in N^2 parts (actually only those parts relevant for the lower triangle) that can later be reassembled? (This might look similar to the solution to a similar problem proposed here.) It seems to be a rather messy solution, though. Further, I will need the 45K x 45K matrix in the end anyway. However, this seems to hit the limit. The system also gives the memory allocation error when generating a 45K x 45K random matrix:

    > N=45000; memorytestmatrix <- matrix( rnorm(N*N,mean=0,sd=1), N, N) Error: cannot allocate vector of size 15.1 Gb

    30K x 30K matrices are possible without problems, R gives the resulting size as

    > print(object.size(memorytestmatrix), units="auto") 6.7 Gb

    1 Gb more and everything would be fine, it seems. Sadly, I do not have any large objects that I could delete to make room. Also, ironically,

    > system('free -m') Warning message: In system("free -m") : system call failed: Cannot allocate memory

    I have to admit that I am not really sure why R refuses to allocate 7.6 Gb; the system certainly has more memory, although not a lot more. ps aux shows the R process as the single biggest memory user. Maybe there is an issue with how much memory R can address even if more is available?

Related questions

  • Answers to other questions related to R running out of memory, like this one, suggest to use a more memory efficient methods of computation.
  • This very helpful answer suggests to delete other large objects to make room for the memory intensive operation.
  • Here, the idea to split the data set and compute distances chunk-wise is suggested.

Software & versions

R version is 3.4.1. System kernel is Linux 4.7.6, x86_64 (i.e. 64bit).

> version
           _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          4.1                         
year           2017                        
month          06                          
day            30                          
svn rev        72865                       
language       R                           
version.string R version 3.4.1 (2017-06-30)
nickname       Single Candle 

Edit (Aug 27): Some more information

  • Updating the Linux kernel to 4.11.9 has no effect.
  • The bigmemory package may also run out of memory. It uses shared memory in /dev/shm/ of which the system by default (but depending on configuration) allows half the size of the RAM. You can increase this at runtime by doing (for instance) mount -o remount,size=12Gb /dev/shm, but this may still not allow usage of 12Gb. (I do not know why, maybe the memory management configuration is inconsistent then). Also, you may end up crashing your system if you are not careful.
  • R apparently actually allows access to the full RAM and can create objects up to that size. It just seems to fail for particular functions such as dist. I will add this as an answer, but my conclusions are a bit based on speculation, so I do not know to what degree this is right.
0range
  • 2,088
  • 1
  • 24
  • 32

2 Answers2

1

R apparently actually allows access to the full RAM. This works perfectly fine:

N=45000; memorytestmatrix <- matrix(nrow=N, ncol=N)

This is the same thing I tried before as described in the original question, but with a matrix of NA's instead of rnorm random variates. Reassigning one of the values in the matrix as float (memorytestmatrix[1,1]<-0.5) still works and recasts the matrix as a float matrix.

Consequently, I suppose, you can have a matrix of that size, but you cannot do it the way the dist function attempts to do it. A possible explanation is that the function operates with multiple objects of that size in order to speed the computation up. However, if you compute the distances element-wise and change the values in place, this works.

library(mefa)     # for the vec2dist function

euclidian <- function(series1, series2) {
    return((sum((series1 - series2)^2))^.5)
}

mx = nrow(ctab2)
distMatrixE <- vec2dist(0.0, size=mx)
for (coli in 1:(mx-1)) {
    for (rowi in (coli+1):mx) {
        # Element indices in dist objects count the rows down column by column from left to righ in lower triangular matrices without the main diagonal. 
        # From row and column indices, the element index for the dist object is computed like so:
        element <- (mx^2-mx)/2 - ((mx-coli+1)^2 - (mx-coli+1))/2 + rowi - coli
        # ... and now, we replace the distances in place
        distMatrixE[element] <- euclidian(ctab2[rowi,], ctab2[coli,])
    }
}

(Note that addressing in dist objects is a bit tricky, since they are not matrices but 1-dimensional vectors of size (N²-N)/2 recast as lower triangular matrices of size N x N. If we go through rows and columns in the right order, it could also be done with a counter variable, but computing the element index explicitly is clearer, I suppose.)

Also note that it may be possible to speed this up by making use of sapply by computing more than one value at a time.

0range
  • 2,088
  • 1
  • 24
  • 32
1

There exist good algorithms that do not need a full distance matrix in memory.

For example, SLINK and DBSCAN and OPTICS.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194