I am looking to calculate a distance matrix using data.table
so that I can carry out a hierarchical clustering algo (hclust
) on it.
The number of items I have is roughly 100k which causes me to run out of memory, each has an x axis value and y axis value (easting and northing), so that a Euclidean distance can be calculated between each item. The main difference to previous questions (Calculate Euclidean distance matrix using a big.matrix object) is this has two elements that make up a specific item, i.e. the x and y element not the standard single element.
Below is a reproducible example of what I have tried using data.table
, but come across problems of trying to allocate too large a vector, so was wondering if there are more memory efficient methods of doing this.
Perhaps using Rcpp
or bigmemory
's big.matrix
function? but would hclust
even work on a big.matrix
object?
Any help as always would be much appreciated.
require(data.table)
require(dplyr)
x <- 100000
tmp <- data.table(Easting=rnorm(x),Northing= rnorm(x))
tmp1 <- copy(tmp)
tmp1[,id:=seq(nrow(tmp1))]
setnames(tmp1,c('Easting','Northing'),c('x1','x2'))
tmp2 <- copy(tmp)
tmp2[,id2:=seq(nrow(tmp2))]
setnames(tmp2,c('Easting','Northing'),c('y1','y2'))
distDT <- CJ(id=seq(nrow(tmp)),
id2=seq(nrow(tmp)))
distDT <- tmp2[tmp1[distDT,,on='id'],,on='id2']
distDT[,d:=sqrt(((x1-y1)^2)+((x2-y2)^2))]
dist_mat <- distDT[,c('id','id2','d'),with=FALSE] %>% spread(id,d)
dist_mat <- dist_mat[,-c('id2'),with=FALSE]
dist_mat <- as.matrix(dist_mat)