0

I am trying to efficiently merge with a condition.

The way I am doing it now is to cross-join (which I want to preserve) except I have one condition for a subset of the columns.

Cross join function (from another question post):

CJ.table.1 <- function(X,Y)
      setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
    
set.seed(1)
#generate data
x = data.table(t=rep(1:10,2), z=sample(1:10,20,replace=T))
x2 = data.table(tprime=rep(1:10,2), zprime=sample(1:10,20,replace=T))

joined = CJ.table.1(x,x2)
> joined
      t  z tprime zprime
  1:  1  3      1     10
  2:  2  4      1     10
  3:  3  6      1     10
  4:  4 10      1     10
  5:  5  3      1     10
 ---                    
396:  6  5     10      5
397:  7  8     10      5
398:  8 10     10      5
399:  9  4     10      5
400: 10  8     10      5

Then I want to make sure t is increasing by 1 only.

setcolorder(joined, c("t", "tprime", "z",'zprime'))
joined=joined[tprime==t+1]

The final desired output is then:

> joined
    t tprime  z zprime
 1: 1      2  3      3
 2: 1      2  3      3
 3: 2      3  4      7
 4: 2      3  2      7
 5: 3      4  6      2
 6: 3      4  7      2
 7: 4      5 10      3
 8: 4      5  4      3
 9: 5      6  3      4
10: 5      6  8      4
11: 6      7  9      1
12: 6      7  5      1
13: 7      8 10      4
14: 7      8  8      4
15: 8      9  7      9
16: 8      9 10      9
17: 9     10  7      4
18: 9     10  4      4
19: 1      2  3      6
20: 1      2  3      6
21: 2      3  4      5
22: 2      3  2      5
23: 3      4  6      2
24: 3      4  7      2
25: 4      5 10      9
26: 4      5  4      9
27: 5      6  3      7
28: 5      6  8      7
29: 6      7  9      8
30: 6      7  5      8
31: 7      8 10      2
32: 7      8  8      2
33: 8      9  7      8
34: 8      9 10      8
35: 9     10  7      5
36: 9     10  4      5
    t tprime  z zprime

The reason I want to condition BEFORE the cross join is that the actual data I have is huge and therefore, it is inefficient to generate the entire thing first and THEN prune it down.

The reason I can't just do a merge is that I need to cross join the other rows as well.

philipxy
  • 14,867
  • 6
  • 39
  • 83
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90
  • I’ve had similar issues with data.table and it would be very helpful to know the answer to this – svenkatesh Mar 11 '19 at 23:42
  • 1
    An inner join is just a cross join followed by selection on a condition. Just write the inner join that is on the condition you want. (Cross join is inner join on true.) – philipxy Mar 12 '19 at 00:28
  • 3
    for e.g. `x[, k:=t+1][x2[, k:=tprime], on=.(k), nomatch=0L][, .(t, tprime, z, zprime)]` – chinsoon12 Mar 12 '19 at 00:30
  • 1
    [1.6 What are the benefits of being able to use column names as if they are variables inside DT\[...\]?](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.html#what-are-the-benefits-of-being-able-to-use-column-names-as-if-they-are-variables-inside-dt...) "inside the square brackets, data.table sees this query as a whole before any part of it is evaluated. Thus it can optimize the combined query for performance. It can do this because the R language uniquely has lazy evaluation" – philipxy Mar 12 '19 at 00:57
  • @chinsoon12, I think you need allowcartesian =T for this. Also, what about it you have multiple conditions? Particularly, suppose you have the t condition, but also a condition that says zprime must be some function of z (e.g. it must be less than the square) – wolfsatthedoor Mar 12 '19 at 08:27
  • 1
    you can add `allowcartesian=TRUE` whenever u get that error message. to add more condition, use something along the line of `x[, c('k', 'y1') := .(t+1, z)][x2[, c('k','y2') := .(tprime, myfun(zprime)], on=.(k=k, y1 – chinsoon12 Mar 12 '19 at 08:35
  • @chinsoon12, interesting. Suppose you have y1 – wolfsatthedoor Mar 12 '19 at 09:43
  • on works with character string as well. just don't leave any space e.g. `on=c("y1 – chinsoon12 Mar 12 '19 at 10:27
  • @chinsoon12 if you post the comment as an answer I am happy to award best answer. Additionally, I asked the multiple conditions inner join here: https://stackoverflow.com/questions/55132230/inner-join-with-multiple-conditions-r-data-table – wolfsatthedoor Mar 13 '19 at 03:41
  • For large joins, I am getting this error: Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table... I have enough memory for sure, but is there any way around? Using R 3.5 – wolfsatthedoor Jun 21 '22 at 01:28
  • @chinsoon12 any advice on the above comment? – wolfsatthedoor Jun 21 '22 at 02:22
  • thats a huge number of rows. r u sure about it? – chinsoon12 Jun 21 '22 at 07:44
  • @chinsoon12 I am sure about it being correct yes – wolfsatthedoor Jun 21 '22 at 12:57
  • @chinsoon12 I have more than enough memory as well. – wolfsatthedoor Jun 21 '22 at 13:18
  • Any idea for a workaround if datatable won't let me @chinsoon12 ? Reproducible example in this question: https://stackoverflow.com/questions/72695800/error-for-cross-join-cartesian-merge-in-data-table-any-solution – wolfsatthedoor Jun 21 '22 at 19:36
  • @chinsoon12 started a bounty here https://stackoverflow.com/questions/61091313/are-data-tables-with-more-than-231-rows-supported-in-r-with-the-data-table-pack – wolfsatthedoor Jun 21 '22 at 20:47

1 Answers1

0

You could create tprime = t+1 in x and join on it:

x[,tprime:=t+1]
x[x2,on=.(tprime),nomatch=0]

    t  z tprime zprime
 1: 1  6      2      6
 2: 1  4      2      6
 3: 2  2      3      1
 4: 2  8      3      1
 5: 3  8      4      4
 6: 3  2      4      4
 7: 4  3      5      4
 8: 4  1      5      4
 9: 5  6      6      6
10: 5  4      6      6
11: 6  6      7      2
12: 6  4      7      2
13: 7  9      8      3
14: 7  4      8      3
15: 8  7      9      1
16: 8  6      9      1
17: 9 10     10      1
18: 9 10     10      1
19: 1  6      2      5
20: 1  4      2      5
21: 2  2      3      4
22: 2  8      3      4
23: 3  8      4     10
24: 3  2      4     10
25: 4  3      5      9
26: 4  1      5      9
27: 5  6      6      4
28: 5  4      6      4
29: 6  6      7      6
30: 6  4      7      6
31: 7  9      8     10
32: 7  4      8     10
33: 8  7      9     10
34: 8  6      9     10
35: 9 10     10      6
36: 9 10     10      6
    t  z tprime zprime

or without modifying x:

x[,.(t,z,tprime=t+1)][
  x2,on=.(tprime),nomatch=0]

According to microbenchmark first solution seems a bit faster:

Unit: milliseconds
                                                                    expr    min      lq
 {     x[, `:=`(tprime, t + 1)]     x[x2, on = .(tprime), nomatch = 0] } 1.1060 1.21755
           x[, .(t, z, tprime = t + 1)][x2, on = .(tprime), nomatch = 0] 1.2967 1.43045
Unit: milliseconds
                                                                    expr    min     lq
 {     x[, `:=`(tprime, t + 1)]     x[x2, on = .(tprime), nomatch = 0] } 1.0921 1.1508
           x[, .(t, z, tprime = t + 1)][x2, on = .(tprime), nomatch = 0] 1.2688 1.3373
     mean  median      uq    max neval
 1.348345 1.22910 1.42265 2.8227   100
 1.521294 1.40295 1.59990 2.9437   100
Waldi
  • 39,242
  • 6
  • 30
  • 78