0

I'm working with a 39000+ data points and I'm computing the distance between a point and every single other one of them, resulting in a (39000+)^2 matrix that consumes 11GB (and I can't allocate this in the memory).

Great thing we have the dist function that allows me to reduce this to a little bit less than 6GB. But now, I need to calculate the inverse distances powered by 2 and then regularize every row so that they sum up to 1. This is necessary because I will later multiply every row of the matrix by a vector and store this result. So, the big matrix is actually a temporary thing.

My question is, how can I extract rows of this dist matrix?

A sample "dist" matrix obtained with dist(cbind(runif(5),runif(5))

       1    2    3    4    
2   0.47                                                                                                                                  
3   0.63 0.72                                                                                                                             
4   0.79 0.62 0.37                                                                                                                        
5   0.53 0.15 0.62 0.48                                                                                                                   


What I'm looking for is to extract the entire first line, for instance:

0  0.47  0.63  0.79  0.53
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248

1 Answers1

3

Resort to function f in my old answer here.

f <- function (i, j, dist_obj) {
  if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
  n <- attr(dist_obj, "Size")
  valid <- (i >= 1) & (j >= 1) & (i > j) & (i <= n) & (j <= n)
  k <- (2 * n - j) * (j - 1) / 2 + (i - j)
  k[!valid] <- NA_real_
  k
  }

A helper function to extract a single row / column (a slice).

SliceExtract_dist <- function (dist_obj, k) {
  if (length(k) > 1) stop("The function is not 'vectorized'!")
  n <- attr(dist_obj, "Size")
  if (k < 1 || k > n) stop("k out of bound!")
  ##
  i <- 1:(k - 1)
  j <- rep.int(k, k - 1)
  v1 <- dist_obj[f(j, i, dist_obj)]
  ## 
  i <- (k + 1):n
  j <- rep.int(k, n - k)
  v2 <- dist_obj[f(i, j, dist_obj)]
  ## 
  c(v1, 0, v2)
  }

Example

set.seed(0)
( d <- dist(cbind(runif(5),runif(5))) )
#          1         2         3         4
#2 0.9401067                              
#3 0.9095143 0.1162289                    
#4 0.5618382 0.3884722 0.3476762          
#5 0.4275871 0.6968296 0.6220650 0.3368478

SliceExtract_dist(d, 1)
#[1] 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871

SliceExtract_dist(d, 2)
#[1] 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296

SliceExtract_dist(d, 3)
#[1] 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650

SliceExtract_dist(d, 4)
#[1] 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478

SliceExtract_dist(d, 5)
#[1] 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000

Sanity check

as.matrix(d)
#          1         2         3         4         5
#1 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871
#2 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296
#3 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650
#4 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478
#5 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000

Note: Function to extract diagonals readily exists.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • 1
    That seems to work amazingly, but a small correction is needed. On the `SliceExtract_dist` definition, it's necessary to change `d[]` to dist_obj[]`. Thank you so much. – Felipe Moreira Jul 31 '19 at 16:36