4

I have a matrix of size 2000 X 700. I want to subtract all possible pairs of rows. If x_i represent a row then I want to compute: x_1-x_2, x_1-x_3, ..., x_2-x_3,...

Example:

mat 
1 2 3
5 3 2
1 1 6

My output should be

x_1 - x_2: -4 -1  1
x_1 - x_3:  0  1 -3
x_2 - x_3:  4  2 -4

I tried using a loop, but it takes too long. Is there an efficient way to compute this?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
ari6739
  • 91
  • 4
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Dec 20 '22 at 16:17
  • Those are not all possible pairs of rows though; only half. How do you define the pairs you want and the order they should be in? (e.g. you seem to want row1 - row2 but not row2 - row1) – Ottie Dec 20 '22 at 17:43

3 Answers3

4

Perhaps use combn

combn(row.names(m1), 2, function(x) m1[x[1],] - m1[x[2],])
akrun
  • 874,273
  • 37
  • 540
  • 662
  • it works. but still too slow for my data size. it will take a few hours that way – ari6739 Dec 20 '22 at 16:43
  • @ari6739 `combn` is more faster than expand.grid or expand_grid or outer as those will do twice the number of comparisons – akrun Dec 20 '22 at 16:44
  • @ari6739 you may try [this](https://stackoverflow.com/questions/26828301/faster-version-of-combn) faster option – akrun Dec 20 '22 at 16:46
1

A relatively fast approach would be to pre-define an index list and then use it on the data, set to data.table. The whole operation should finish in under a minute for a matrix of 2000 x 700.

library(data.table)

setDT(mat)

rows <- nrow(mat)
idx <- as.matrix(rbindlist(lapply(1:(rows - 1), function(x) 
  rbindlist(lapply((x + 1):rows, function(y) list(x, y)))))) # takes approx. 6 secs on my crappy system for 2000 x 2000 combinations
idx
     V1 V2
[1,]  1  2
[2,]  1  3
[3,]  2  3

mat[idx[, 1], ] - mat[idx[, 2], ] # takes approx. 12 secs for 700 columns, see below if there's a memory error "Error: vector memory exhausted (limit reached?)"
   V1 V2 V3
1: -4 -1  1
2:  0  1 -3
3:  4  2 -4

If the data is very wide, the subtraction operation may not fit into memory due to the vectorized nature. A solution is to split the operation into smaller chunks by cycling through the indices, e.g.

rbindlist(apply(
  cbind(unique(floor(c(1, seq(1, nrow(idx), length.out=10)[2:9] + 1))), 
        unique(floor(seq(1, nrow(idx), length.out=10)[2:10]))), 1, function(x)
  mat[idx[x[1]:x[2], 1],] - mat[idx[x[1]:x[2], 2],]))
         V1 V2 V3 V1 V2 V3 V1 V2 V3 V1
      1: -4 -1  1 -4 -1  1 -4 -1  1 -4
      2:  0  1 -3  0  1 -3  0  1 -3  0
      3:  0  0  0  0  0  0  0  0  0  0
      4: -4 -1  1 -4 -1  1 -4 -1  1 -4
      5:  0  1 -3  0  1 -3  0  1 -3  0
     ---                              
1998996:  4  1 -1  4  1 -1  4  1 -1  4
1998997:  0  0  0  0  0  0  0  0  0  0
1998998:  0 -1  3  0 -1  3  0 -1  3  0
1998999: -4 -2  4 -4 -2  4 -4 -2  4 -4
1999000: -4 -1  1 -4 -1  1 -4 -1  1 -4

Data

mat <- structure(list(V1 = c(1L, 5L, 1L), V2 = c(2L, 3L, 1L), V3 = c(3L, 
2L, 6L)), class = "data.frame", row.names = c(NA, -3L))
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
1

Another use of combn (along with asplit)

> t(combn(asplit(mat, 1), 2, function(x) do.call(`-`, x)))
     [,1] [,2] [,3]
[1,]   -4   -1    1
[2,]    0    1   -3
[3,]    4    2   -4
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81