0

I am looking to calculate a distance matrix using data.table so that I can carry out a hierarchical clustering algo (hclust) on it.

The number of items I have is roughly 100k which causes me to run out of memory, each has an x axis value and y axis value (easting and northing), so that a Euclidean distance can be calculated between each item. The main difference to previous questions (Calculate Euclidean distance matrix using a big.matrix object) is this has two elements that make up a specific item, i.e. the x and y element not the standard single element.

Below is a reproducible example of what I have tried using data.table, but come across problems of trying to allocate too large a vector, so was wondering if there are more memory efficient methods of doing this.

Perhaps using Rcpp or bigmemory's big.matrix function? but would hclust even work on a big.matrix object?

Any help as always would be much appreciated.

require(data.table)
require(dplyr)
x <- 100000

tmp <- data.table(Easting=rnorm(x),Northing= rnorm(x))

tmp1 <- copy(tmp)
tmp1[,id:=seq(nrow(tmp1))]
setnames(tmp1,c('Easting','Northing'),c('x1','x2'))
tmp2 <- copy(tmp)
tmp2[,id2:=seq(nrow(tmp2))]
setnames(tmp2,c('Easting','Northing'),c('y1','y2'))

distDT <- CJ(id=seq(nrow(tmp)),
             id2=seq(nrow(tmp)))
distDT <- tmp2[tmp1[distDT,,on='id'],,on='id2']
distDT[,d:=sqrt(((x1-y1)^2)+((x2-y2)^2))]
dist_mat <- distDT[,c('id','id2','d'),with=FALSE] %>% spread(id,d)
dist_mat <- dist_mat[,-c('id2'),with=FALSE]
dist_mat <- as.matrix(dist_mat)
Community
  • 1
  • 1
h.l.m
  • 13,015
  • 22
  • 82
  • 169
  • You're trying to allocate a matrix that has ~5B non-`NA` elements and 5B `NA` elements. `base` R has a limit on `matrix` size of 2B elements... But realistically, I don't think you should be doing that in R period. `pryr::object_size(rnorm(10e8))` is 8GB--or 40GB for a vector of your non-missing elements. – alexwhitworth Jun 27 '16 at 23:21
  • Lets say x is 10k, the above code still doesn't work...any suggestions on how it the process could be changed for it to work? – h.l.m Jun 27 '16 at 23:25
  • 1
    Possible duplicates: (1) [distance matrix using big.matrix](http://stackoverflow.com/q/26958646/903061), (2) [dist function with large number of points](http://stackoverflow.com/q/16190214/903061), (3) [big matrix and memory problems](http://stackoverflow.com/q/35486336/903061). – Gregor Thomas Jun 27 '16 at 23:26
  • I'm going to go ahead and close as dupe of the one with the Rcpp answer. If you feel your question has distinguishing features and should be reopened, please edit to highlight the differences. – Gregor Thomas Jun 27 '16 at 23:35
  • @Gregor the question has been adjusted to show the differences with other previous questions. The main difference being that it is using two values x and y to calculate distance not just one... – h.l.m Jun 28 '16 at 09:46
  • I'm pretty sure the linked dupe makes no 1-d assumption - it looks like it works for however many columns you have, including 2-dimensions. – Gregor Thomas Jun 28 '16 at 15:38

0 Answers0