8

Given a data frame or matrix with arbitrary number of rows and columns, what is the fastest way to apply a function to all pairwise combinations of columns?

For example, if I have a data table:

N <- 3
K <- 3
data <- data.table(id=seq(N))
for(k in seq(K)) {
    data[[k]] <- runif(N)
}

And I want to compute the simple difference between all pairs of columns, I could loop (or lapply) over columns:

differences = data.table(foo=seq(N))
for(var1 in names(data)) {
    for(var2 in names(data)) {
        if (var1==var2) next
        if (which(names(data)==var1)>which(names(data)==var2)) next
        combo <- paste0(var1, var2)
        differences[[combo]] <- data[[var1]]-data[[var2]]
    }
}

But as K gets larger, this becomes absurdly slow.

One solution I've considered is to make two new data tables using combn and subtract them:

a <- data[,combn(colnames(data),2)[1,],with=F]
b <- data[,combn(colnames(data),2)[2,],with=F]
differences <- a-b

But as N and K get larger, this becomes very memory intensive (though faster than looping).

It seems to me that the outer product of the matrix with itself is probably the best way to go, but I can't piece it together. This is especially hard if I want to apply an arbitrary function (RMSE for example), instead of just the difference.

What's the fastest way?

dmp
  • 815
  • 1
  • 6
  • 19
  • 1
    Please take time to make your example reproducible. There is no `differences` data.table defined. –  Jan 13 '16 at 02:08
  • 2
    "All pairwise combinations" is a n^2 problem. There is no way around it. If N is large, I would go for nested loops, like what you have above. Maybe you are not interested in case where `var1==var2`, or maybe you know how results for `i,j` and `j,i` are related. This may give you 2x speed boost. If N is small, I would also use matrix and numeric column indices (indexing by name is slower). But if N is large, that does not matter. – Ott Toomet Jan 13 '16 at 03:15
  • 1
    `data.m <- data.u <- melt(data) DAT <- CJ(data.m$value, data.m$value)[,diff := V1 - V2]` provides all the differences (and quickly), but working out which difference goes where requires additional steps. What are you using these differences for? – Hugh Jan 13 '16 at 04:52
  • Thanks, I wasn't aware of `CJ()`. For my current application, I have a data frame with >20 thousand columns, some of which are nearly identical, but not exactly. There's no pattern to which ones are nearly identical, and there can be more than 1 duplicate per column. I want to ultimately compute RMSE or something to weed out the duplicates. – dmp Jan 13 '16 at 18:49
  • Are these columns time-series? – Alexander Radev Jan 13 '16 at 19:23
  • Yes they are time series. – dmp Jan 13 '16 at 19:26

1 Answers1

2

If it is necessary to have the data in a matrix first, you can do the following:

library(data.table)

data <- matrix(runif(300*500), nrow = 300, ncol = 500)

data.DT <- setkey(data.table(c(data), colId = rep(1:500, each = 300), rowId = rep(1:300, times = 500)), colId)

diff.DT <- data.DT[
  , {
    ccl <- unique(colId)
    vv <- V1
    data.DT[colId > ccl, .(col2 = colId, V1 - vv)]
  }
  , keyby = .(col1 = colId)
]
Alexander Radev
  • 652
  • 5
  • 11
  • Interesting, thanks. Benchmarking your solution against my second solution above indicates that it's slightly slower when N and K are small, but ~15x faster as they get large. – dmp Jan 13 '16 at 18:47
  • 1
    Follow-up question: when K and N get _really_ large (in the thousands), I get the "negative length vectors are not allowed" error from data.table. I think this means I'm reaching the RAM limit of the object, right? Do you happen to know how to avoid it? – dmp Jan 14 '16 at 20:14