8

I'm trying to replace Cartesian product produced by SQL by data.table call. I have large history with assets and values, and I need a subset of all combinations. Let's say that I have table a with T = [date, contract, value]. In SQL it looks like

SELECT a.date, a.contract, a.value, b.contract. b.value 
FROM T a, T b
WHERE a.date = b.date AND a.contract <> b.contract AND a.value + b.value < 4

In R I have now the following

library(data.table)

n <- 1500
dt <- data.table(date     = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
                 contract = c(rep("a", n), rep("b", n), rep("c", n)),
                 value    = c(rep(1, n), rep(2, n), rep(3, n)))
setkey(dt, date)

dt[dt, allow.cartesian = TRUE][(contract != i.contract) & (value + i.value < 4)]

I believe that my solution creates all combinations first (in this case 13,500 rows) and then filter (to 3000). SQL however (and I might be wrong) joining subset, and what is more important don't load all combinations into RAM. Any ideas how to use data.table more efficient?

kismsu
  • 1,049
  • 7
  • 22

2 Answers2

7

Use by = .EACHI feature. In data.table joins and subsets are very closely linked; i.e., a join is just another subset - using data.table - instead of the usual integer / logical / row names. They are designed this way with these cases in mind.

Subset based joins allow to incorporate j-expressions and grouping operations together while joining.

require(data.table)
dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE]

This is the idiomatic way (in case you'd like to use i.* cols just for condition, but not return them as well), however, .SD has not yet been optimised, and evaluating the j-expression on .SD for each group is costly.

system.time(dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE])
#    user  system elapsed 
#   2.874   0.020   2.983 

Some cases using .SD have already been optimised. Until these cases are taken care of, you can workaround it this way:

dt[dt, {
        idx = contract != i.contract & value + i.value < 4L
        list(contract = contract[idx],
             value = value[idx], 
             i.contract = i.contract[any(idx)],
             i.value = i.value[any(idx)]
        )
       }, by = .EACHI, allow = TRUE]

And this takes 0.045 seconds, as opposed to 0.005 seconds from your method. But by = .EACHI evaluates the j-expression each time (and therefore memory efficient). That's the trade-off you'll have to accept.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Arun
  • 116,683
  • 26
  • 284
  • 387
0

Since version v1.9.8 (on CRAN 25 Nov 2016), non-equi joins are possible with data.table which can be utilized here.

In addition, OP's approach creates "symmetric duplicates" (a, b) and (b, a). Avoiding duplicates would halfen the size of the result set without loss of information (compare ?combn)

If this is the intention of the OP we can use non-equi joins to avoid those symmetric duplicates:

library(data.table)
dt[, rn := .I][dt, on = .(date, rn < rn), nomatch = 0L][value + i.value < 4]

which gives

            date contract value   rn i.contract i.value
   1: 2013-09-24        a     1 1501          b       2
   2: 2013-09-25        a     1 1502          b       2
   3: 2013-09-26        a     1 1503          b       2
   4: 2013-09-27        a     1 1504          b       2
   5: 2013-09-28        a     1 1505          b       2
  ---                                                  
1496: 2017-10-28        a     1 2996          b       2
1497: 2017-10-29        a     1 2997          b       2
1498: 2017-10-30        a     1 2998          b       2
1499: 2017-10-31        a     1 2999          b       2
1500: 2017-11-01        a     1 3000          b       2

as opposed to the result using OP's code

            date contract value i.contract i.value
   1: 2013-09-24        b     2          a       1
   2: 2013-09-24        a     1          b       2
   3: 2013-09-25        b     2          a       1
   4: 2013-09-25        a     1          b       2
   5: 2013-09-26        b     2          a       1
  ---                                             
2996: 2017-10-30        a     1          b       2
2997: 2017-10-31        b     2          a       1
2998: 2017-10-31        a     1          b       2
2999: 2017-11-01        b     2          a       1
3000: 2017-11-01        a     1          b       2

The next step is to further reduce the number of pairs created which are need to be filtered out afterwards:

dt[, val4 := 4 - value][dt, on = .(date, rn < rn, val4 > value), nomatch = 0L]

which returns the same result as above.

Note that filter condition value + i.value < 4 is replaced by another join condition val4 > value where val4 is an especially created helper column.

Benchmark

For a benchmark case of n <- 150000L resulting in 450 k rows in dt the timings are:

n <- 150000L
dt <- data.table(date     = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
                 contract = c(rep("a", n), rep("b", n), rep("c", n)),
                 value    = c(rep(1, n), rep(2, n), rep(3, n)))

dt0 <- copy(dt)
microbenchmark::microbenchmark(
  OP = {
    dt <- copy(dt0)
    dt[dt, on = .(date), allow.cartesian = TRUE][
      (contract != i.contract) & (value + i.value < 4)]
  },
  nej1 = {
    dt <- copy(dt0)
    dt[, rn := .I][dt, on = .(date, rn < rn), nomatch = 0L][value + i.value < 4]
  },
  nej2 = {
    dt <- copy(dt0)
    dt[, rn := .I][, val4 := 4 - value][dt, on = .(date, rn < rn, val4 > value), nomatch = 0L]
  },
  times = 20L
)
Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
   OP 136.3091 143.1656 246.7349 298.8648 304.8166 311.1141    20   b
 nej1 127.9487 133.1772 160.8096 136.0825 146.0947 298.3348    20  a 
 nej2 180.4189 183.9264 219.5171 185.9385 198.7846 351.3038    20   b

So, doing the check value + i.value < 4 after the join seems to be faster than including it in the non-equi join.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134