7

What is the fastest calculation for bipartite distance in R with a parallelized Rcpp backend?

parallelDist is a great package with a cpp backend and support for multi-threading, but does not support bipartite distance calculations (to my knowledge).

Using parallelDist() for bipartite distance matrix computation. This involves calculating m1:m1 and m2:m2 in addition to m1:m2 -- highly inefficient.

library(parallelDist)

bipartiteDist <- function(matrix1,matrix2){
  matrix12 <- rbind(matrix1,matrix2)
  d <- parallelDist(matrix12)
  d <- as.matrix(d)[(1:nrow(matrix1)),((nrow(matrix1)+1):(nrow(matrix1)*2))]
  d
}

matrix1 <- abs(matrix(rnorm(1000),10,100000))
matrix2 <- abs(matrix(rnorm(1000),10,100000))

dist <- bipartiteDist(matrix1, matrix2)

This approach is faster than pDist or a pure R implementation when more than 3 cores are available.

pdist is perfect for computing bipartite distances, but does not support multithreading.

Any fast implementations for parallelized bipartite distance computation?

zdebruine
  • 3,687
  • 6
  • 31
  • 50

1 Answers1

5

The wordspace dist.matrix() function supports parallelized calculation of bipartite distances.

Benchmarking wordspace against parallelDist

matrix1 <- abs(matrix(rnorm(1000),100,100000))
matrix2 <- abs(matrix(rnorm(1000),100,100000))

library(rbenchmark)
library(parallelDist)
library(wordspace)

bipartiteDist_parallelDist <- function(matrix1,matrix2){
  matrix12 <- rbind(matrix1,matrix2)
  d <- parallelDist(matrix12, method = "euclidean")
  d <- as.matrix(d)[(1:nrow(matrix1)),((nrow(matrix1)+1):(nrow(matrix1)*2))]
  d
}

bipartiteDist_wordspace <- function(matrix1,matrix2){
  wordspace.openmp(threads = wordspace.openmp()$max)
  dist.matrix(matrix1,matrix2, byrow = TRUE, method = "euclidean", convert = FALSE)
}

benchmark("parallelDist" = {
            bd1 <- bipartiteDist_parallelDist(matrix1,matrix2)
          },
          "wordspace" = {
            bd2 <- bipartiteDist_wordspace(matrix1,matrix2)
          },
          replications = 1,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))

plot(bd1,bd2) # yes, both methods give near-identical results

Benchmarking results:

          test replications elapsed relative user.self sys.self
1 parallelDist            1   2.120   12.184   126.145    0.523
2    wordspace            1   0.174    1.000     3.749    0.252

I used 80 threads.

A framework for further speed gains

The author of wordspace admits to emphasizing low memory load over speed, thus additional speed gains are possible (source).

For instance, here is a general framework for Euclidean distance:

bipartiteDist3 <- function(matrix1,matrix2){
  m1tm2 <- tcrossprod(matrix1,matrix2)
  sq1 <- rowSums(matrix1^2)
  sq2 <- rowSums(matrix2^2)
  out0 <- outer(sq1, sq2, "+") - 2 * m1tm2
  sqrt(out0)
}

I am very interested in a parallelized solution optimized for sparse matrices. To my knowledge, wordspace does not optimize for sparsity. For example, there are parallelizable sparse matrix implementations of tcrossprod, rowSums, and outer function equivalents.

zdebruine
  • 3,687
  • 6
  • 31
  • 50
  • 1
    Some good work has gone into this question and your own answer. It does seem like a lot of work to come up with a great answer to your question however. Have you considered turning to GPU calculations using libraries like `RcppArrayFire`? It's been a while since it has been updated, but I believe it is still very much functional for matrix algebra on the GPU. It wouldn't utilize sparse calculations, but it might be sufficient for your problem, while still being simple to implement. – Oliver Oct 30 '20 at 19:03
  • 1
    @Oliver Thanks for that. I certainly see the allure of GPU here, but it wouldn't be ideal due to the CPU-focused nature of other algorithms in my pipeline. Wordspace has met my present needs excellently, and I suspect it's the fastest accessible option for most users with similar needs. But it's always important to be aware of potential improvements. – zdebruine Oct 31 '20 at 02:43