0

I'd like to run hierarchical clustering on a "large" matrix of dimensions 69878 x 10 but can't manage to because running hclust in R requires first computing the pairwise distances and doing so crashes on these dimensions:

> str(x)
 num [1:69878, 1:10] 0 0 0 0 0 0 0 9 1 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:10] "0.5" "1" "1.5" "2" ...
> d <- dist(x)  
Error: cannot allocate vector of size 18.2 Gb

Is there a way to circumvent this limitation?

SkyWalker
  • 13,729
  • 18
  • 91
  • 187
  • Get more memory for your computer? Alternatively you can draw a sample from your data that will fit in your computer memory and cluster that. Decide how many clusters you want and compute the centroids for those clusters. Use the centroids with `kmeans()` on the full data set. Whether this makes sense depends on what you want to do with the results. – dcarlson Nov 20 '19 at 23:25
  • 1
    See https://stackoverflow.com/questions/53032431/is-it-possible-to-run-a-clustering-algorithm-with-chunked-distance-matrices and https://stackoverflow.com/questions/40989003/hclust-in-r-on-large-datasets – ThetaFC Nov 21 '19 at 00:25

0 Answers0