4

I am running R 3.2.3 on a machine with 128 GB of RAM. I have a large matrix of 123028 rows x 168 columns. I would like to use a hierarchical clustering algorithm in R, so before I do that, I am trying to create a distance matrix in R using the vegdist() function in the vegan package with the method Bray-Curtis. I get an error about memory allocation:

df <- as.data.frame(matrix(rnorm(20668704), nrow = 123028))
library(vegan)
mydist <- vegdist(df)

Error in vegdist(df) : long vectors (argument 4) are not supported in .Fortran

If I use the pryr package to find out how much memory is needed for the distance matrix, I see that 121 GB are needed, which is less than the RAM that I have.

library(pryr)
mem_change(x <- 1:123028^2)

121 GB

I know there used to be a limit of 2 billion values for a single object in R, but I thought that limit disappeared in recent versions of R. Is there another memory limit I'm not aware of?

The bottom line is that I am wondering: What can I do about this error? Is it really because of memory limits or am I wrong about that? I would like to stay in R and use a clustering algorithm besides k-means, so I need to calculate a distance matrix.

jk22
  • 95
  • 1
  • 1
  • 8

1 Answers1

4

R can handle long vectors just fine, but it seems that the distance matrix calculation is implemented in C or Fortran and being interfaced with R using .C or .Fortran, which do not accept long vectors (i.e. vectors with length > 2^32 -1) as arguments. See the docs here, which states:

Note that the .C and .Fortran interfaces do not accept long vectors, so .Call (or similar) has to be used.

Looking at the source code for the vegdist() function, it looks like your matrix is being converted into a vector and then passed to a function implemented in C to calculate the distances. The relevant lines of code:

d <- .C("veg_distance", x = as.double(x), nr = N, nc = ncol(x), 
        d = double(N * (N - 1)/2), diag = as.integer(FALSE), 
        method = as.integer(method), NAOK = na.rm, PACKAGE = "vegan")$d

And therein lies your problem. When your matrix is cast to a vector, it becomes a long vector, which is not supported by .C. You will have to look for a different package to calculate your distance matrix (or implement one yourself).

ialm
  • 8,510
  • 4
  • 36
  • 48
  • Thanks! This did answer the question about why I'm getting the error. My next question is what can I do about it? I can try doing it piecewise as Alex suggested above, but are there any other existing functions that will calculate Bray-Curtis distance on a larger matrix? – jk22 Dec 15 '15 at 15:51
  • @jk22 As a first step, you should contact the vegan authors. They are very helpful in my experience, and may be able to find a solution that can be incorporated into future vegan releases. – Tyler Dec 15 '15 at 17:03
  • @jk22 Doing a quick google search yields the [ecodist](https://cran.r-project.org/web/packages/ecodist/) package that has the `bcdist()` function to compute Bray-Curtis distances. I'm not sure if it handles larger matrices or not. – ialm Dec 15 '15 at 17:15
  • Also, I would follow the suggestion made by @Tyler and contact the vegan authors about your problem so that the issue may get resolved. – ialm Dec 15 '15 at 17:18
  • 1
    I'm here (a vegan author). The analysis is correct: `vegdist()` does not handle long vectors. I have no plans to change this, because I don't have hardware were I could use long vectors (memory is exhausted first), but I'll welcome fixes (vegan is in github). Function `designdist` is in pure R and may handle long vectors, but it uses more memory for temporary objects (3x more). Similarly `dist(x, "manhattan")/as.dist(outer(rowSums(x), rowSums(x), "+"))` is Bray-Curtis, but uses memory for temporary objects. In addition, anything you do will need huge amounts of CPU time: is it worth it all? – Jari Oksanen Dec 16 '15 at 10:18
  • Thanks to all for the ideas and information. @Jari Oksanen, I appreciate your response, and indeed, I am considering whether it is worth it to use such a large matrix. One alternative that is appealing is a k-means clustering with a large number of clusters, followed by a hierarchical clustering of the k clusters, similar to the suggestion [here](http://stackoverflow.com/questions/21984940/clustering-very-large-dataset-in-r/21990387#21990387). One hesitation with doing it that way is that I'm not sure that further analysis on a hierarchical clustering of the k clusters will be straightforward. – jk22 Dec 16 '15 at 20:00