4

This is a direkt expansion of this Question. I have a dataset and I want to find all pairwise combinations of Variable v depending on Variables x and y:

library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,1,6), v=1:18)
        x y  v
     1: a 1  1
     2: a 1  2
     3: a 6  3
     4: a 1  4
     5: a 1  5
     6: a 6  6
     7: b 1  7
     8: b 1  8
     9: b 6  9
    10: b 1 10
    11: b 1 11
    12: b 6 12
    13: c 1 13
    14: c 1 14
    15: c 6 15
    16: c 1 16
    17: c 1 17
    18: c 6 18

DT[, list(new1 = t(combn(sort(v), m = 2))[,1], 
   new2 = t(combn(sort(v), m = 2))[,2]), 
   by = list(x, y)]
        x y new1 new2
     1: a 1    1    2
     2: a 1    1    4
     3: a 1    1    5
     4: a 1    2    4
     5: a 1    2    5
     6: a 1    4    5
     7: a 6    3    6
     8: b 1    7    8
     9: b 1    7   10
    10: b 1    7   11
    11: b 1    8   10
    12: b 1    8   11
    13: b 1   10   11
    14: b 6    9   12
    15: c 1   13   14
    16: c 1   13   16
    17: c 1   13   17
    18: c 1   14   16
    19: c 1   14   17
    20: c 1   16   17
    21: c 6   15   18

The Code does what I want but the twice function call makes it slow for larger dataset. My dataset has more than 3 million rows and more than 1.3 million combinations of x and y. Any suggestions on how to do this faster? I would prefer something like:

DT[, list(c("new1", "new2") = t(combn(sort(v), m = 2))), by = list(x, y)]
Community
  • 1
  • 1
fc9.30
  • 2,293
  • 20
  • 19
  • I see two things that could speed up this process 1. Convert the dataframe into the matrix. 2. Pre allocate memory to 'new2' before the process I think overall you will find this very useful http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r – Pork Chop Mar 13 '14 at 10:31
  • 3
    @pops except that it is a `data.table` which is not the same as a `data.frame` (but it does inherit from that class). – Simon O'Hanlon Mar 13 '14 at 10:33
  • Your `v` column in the example data does not correspond to the code you posted. Which is correct? – Simon O'Hanlon Mar 13 '14 at 10:34
  • How many unique values are there in `x` and in `y` separately? Together they come about 1.3m, you've said already. – Arun Mar 13 '14 at 12:06
  • 1 mio. unique x values and 400T unique y values – fc9.30 Mar 13 '14 at 13:21

2 Answers2

5

This should work:

DT[, {
    tmp <- combn(sort(v), m = 2 )
    list(new1 = tmp[1,], new2 = tmp[2,] )
  }
, by = list(x, y) ]
Beasterfield
  • 7,023
  • 2
  • 38
  • 47
  • This solution is not as nice as the suggestion of @shadow but it is more than 16 times faster (for my specific dataset) ... – fc9.30 Mar 14 '14 at 08:54
4

The following also works. The trick is to convert the matrix into a data.table.

DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]

If necessary, just rename the columns after

r2 <- DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]
setnames(r2, c("V1", "V2"), c("new1", "new2"))
shadow
  • 21,823
  • 4
  • 63
  • 77
  • IF v is a character variable you need an aditional data.frame() because data.table converts the character into a factor: `r2 <- DT[, data.table(data.frame(t(combn(sort(v), m = 2)), stringsAsFactors = F)), by=list(x, y)]` – fc9.30 Mar 13 '14 at 13:24