I am using energy::dcor
to calculate the distance correlation between pair of vectors.
However, I now need to do it with a lot of columns (~2600), and my code is taking about 1 hour or so to complete.
I understand this might be normal, but I wanted to check whether there was a better and more efficient way to do it (e.g. using data.table, avoiding for loops, etc.).
My code:
N <- 100
K <- 200
foo <- matrix(runif(N*K), nrow = N, ncol = K)
colnames(foo) <- colnames(foo, do.NULL = FALSE, prefix = "col")
result <- matrix(NA, nrow = ncol(foo), ncol = ncol(foo))
dimnames(result) <- list(colnames(foo), colnames(foo))
for (i in 1:ncol(foo)) {
other.cols <- setdiff(1:ncol(foo), i)
for (j in other.cols) {
X <- na.omit(foo[, c(i, j)])
r <- energy::dcor(X[, 1], X[, 2])
result[i,j] <- r
}
}
Doing some profiling I discovered that X <- na.omit(foo[, c(i, j)])
is the line taking most of the time to compute, but I don't know how to improve it. Basically that is needed because dcor
does not want NA
in the dataset.