applying distance correlation (or another funciton) to each pair of columns of very large dataset efficiently

Question

I am using energy::dcor to calculate the distance correlation between pair of vectors.

However, I now need to do it with a lot of columns (~2600), and my code is taking about 1 hour or so to complete.

I understand this might be normal, but I wanted to check whether there was a better and more efficient way to do it (e.g. using data.table, avoiding for loops, etc.).

My code:

N <- 100
K <- 200
foo <- matrix(runif(N*K), nrow = N, ncol = K)
colnames(foo) <- colnames(foo, do.NULL = FALSE, prefix = "col")
result <- matrix(NA, nrow = ncol(foo), ncol = ncol(foo))
dimnames(result) <- list(colnames(foo), colnames(foo))

for (i in 1:ncol(foo)) {
  other.cols <- setdiff(1:ncol(foo), i)
  for (j in other.cols) {
    X <- na.omit(foo[, c(i, j)])
    r <- energy::dcor(X[, 1], X[, 2])
    result[i,j] <- r
  }
}

Doing some profiling I discovered that X <- na.omit(foo[, c(i, j)]) is the line taking most of the time to compute, but I don't know how to improve it. Basically that is needed because dcor does not want NA in the dataset.

As the resulting matrix is symmetric you can calculate the correlations only for half of the result matrix (upper or lower triangular part). Reducing the time in half. — minem, Aug 01 '23 at 20:22

umbe1987 · Answer 1 · 2023-08-11T21:26:58.293

Just wanted to post an improvement following @minem suggestion.

I made sure not to calculate the correlation for the same pairs of columns twice. Since the matrix is symmetric, I only calculate the upper triangle and then mirror it in the lower triangle along the diagonal (using this solution).

updated code

N <- 100
K <- 200
foo <- matrix(runif(N*K), nrow = N, ncol = K)
colnames(foo) <- colnames(foo, do.NULL = FALSE, prefix = "col")
result <- matrix(NA, nrow = ncol(foo), ncol = ncol(foo))
dimnames(result) <- list(colnames(foo), colnames(foo))
processed <- rep(NA, K) # THIS IS NEW

for (i in 1:ncol(foo)) {
  processed <-processed[i] # THIS IS NEW
  other.cols <- setdiff(1:ncol(foo), i)
  for (j in other.cols) {
    if (j %in% processed) next # THIS IS NEW
    X <- na.omit(foo[, c(i, j)])
    r <- energy::dcor(X[, 1], X[, 2])
    result[i,j] <- r
  }
  result[lower.tri(result)] <- t(result)[lower.tri(result)] # THIS IS NEW
}

applying distance correlation (or another funciton) to each pair of columns of very large dataset efficiently

1 Answers1