1

I am using energy::dcor to calculate the distance correlation between pair of vectors.

However, I now need to do it with a lot of columns (~2600), and my code is taking about 1 hour or so to complete.

I understand this might be normal, but I wanted to check whether there was a better and more efficient way to do it (e.g. using data.table, avoiding for loops, etc.).

My code:

N <- 100
K <- 200
foo <- matrix(runif(N*K), nrow = N, ncol = K)
colnames(foo) <- colnames(foo, do.NULL = FALSE, prefix = "col")
result <- matrix(NA, nrow = ncol(foo), ncol = ncol(foo))
dimnames(result) <- list(colnames(foo), colnames(foo))

for (i in 1:ncol(foo)) {
  other.cols <- setdiff(1:ncol(foo), i)
  for (j in other.cols) {
    X <- na.omit(foo[, c(i, j)])
    r <- energy::dcor(X[, 1], X[, 2])
    result[i,j] <- r
  }
}

Doing some profiling I discovered that X <- na.omit(foo[, c(i, j)]) is the line taking most of the time to compute, but I don't know how to improve it. Basically that is needed because dcor does not want NA in the dataset.

umbe1987
  • 2,894
  • 6
  • 35
  • 63
  • 1
    As the resulting matrix is symmetric you can calculate the correlations only for half of the result matrix (upper or lower triangular part). Reducing the time in half. – minem Aug 01 '23 at 20:22

1 Answers1

0

Just wanted to post an improvement following @minem suggestion.

I made sure not to calculate the correlation for the same pairs of columns twice. Since the matrix is symmetric, I only calculate the upper triangle and then mirror it in the lower triangle along the diagonal (using this solution).

updated code

N <- 100
K <- 200
foo <- matrix(runif(N*K), nrow = N, ncol = K)
colnames(foo) <- colnames(foo, do.NULL = FALSE, prefix = "col")
result <- matrix(NA, nrow = ncol(foo), ncol = ncol(foo))
dimnames(result) <- list(colnames(foo), colnames(foo))
processed <- rep(NA, K) # THIS IS NEW

for (i in 1:ncol(foo)) {
  processed <-processed[i] # THIS IS NEW
  other.cols <- setdiff(1:ncol(foo), i)
  for (j in other.cols) {
    if (j %in% processed) next # THIS IS NEW
    X <- na.omit(foo[, c(i, j)])
    r <- energy::dcor(X[, 1], X[, 2])
    result[i,j] <- r
  }
  result[lower.tri(result)] <- t(result)[lower.tri(result)] # THIS IS NEW
}
umbe1987
  • 2,894
  • 6
  • 35
  • 63