6

In answering this question about rolling joins with the data.table package, I've run into some odd behavior when using multiple conditions.

Considering the following datasets:

dt <- data.table(t_id = c(1,4,2,3,5), place = c("a","a","d","a","d"), num = c(5.1, 5.1, 6.2, 5.1, 6.2), key=c("place"))
dt_lu <- data.table(f_id = c(rep(1,4),rep(2,3)), place = c("a","b","c","d","a","d","a"), num = c(6,7,8,9,6,7,8), key=c("place"))

When I want to join dt with dt_lu where only those cases of dt_lu that have the same place and where dt_lu$num is higher than dt$num as follows:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[i.num < num],
               fid = f_id),
      by = .EACHI]

I get the desired result:

    place tid tnum fnum fid
 1:     a   1  5.1    6   1
 2:     a   1  5.1    6   2
 3:     a   1  5.1    8   2
 4:     a   4  5.1    6   1
 5:     a   4  5.1    6   2
 6:     a   4  5.1    8   2
 7:     a   3  5.1    6   1
 8:     a   3  5.1    6   2
 9:     a   3  5.1    8   2
10:     d   2  6.2    9   1
11:     d   2  6.2    7   2
12:     d   5  6.2    9   1
13:     d   5  6.2    7   2

When I want to add an additional condition, I can get the desired result easily by chaining that additional conditions as follows:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[i.num < num],
               fid = f_id),
      by = .EACHI][fnum - tnum < 2]

which gives me:

   place tid tnum fnum fid
1:     a   1  5.1    6   1
2:     a   1  5.1    6   2
3:     a   4  5.1    6   1
4:     a   4  5.1    6   2
5:     a   3  5.1    6   1
6:     a   3  5.1    6   2
7:     d   2  6.2    7   2
8:     d   5  6.2    7   2

However when I add the extra condition (i.e.: the difference has to be less than 2) as follows:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[i.num < num & num - i.num < 2],
               fid = f_id),
      by = .EACHI]

I do not get the expected result:

    place tid tnum fnum fid
 1:     a   1  5.1    6   1
 2:     a   1  5.1    6   2
 3:     a   1  5.1    6   2
 4:     a   4  5.1    6   1
 5:     a   4  5.1    6   2
 6:     a   4  5.1    6   2
 7:     a   3  5.1    6   1
 8:     a   3  5.1    6   2
 9:     a   3  5.1    6   2
10:     d   2  6.2    7   1
11:     d   2  6.2    7   2
12:     d   5  6.2    7   1
13:     d   5  6.2    7   2

Moreover, I get the following warning message:

Warning message: In [.data.table(dt_lu, dt, list(tid = i.t_id, tnum = i.num, fnum = num[i.num < : Column 3 of result for group 1 is length 2 but the longest column in this result is 3. Recycled leaving remainder of 1 items. This warning is once only for the first group with this issue.

The expected result would be:

    place tid tnum fnum fid
 1:     a   1  5.1    6   1
 2:     a   1  5.1    6   2
 4:     a   4  5.1    6   1
 5:     a   4  5.1    6   2
 7:     a   3  5.1    6   1
 8:     a   3  5.1    6   2
11:     d   2  6.2    7   2
13:     d   5  6.2    7   2

I deliberately kept the rownumbers from the first example to show which rows have to be maintained in the final result (which is the same as the working solution).

As this answer shows, it should be possible to use multiple conditions inside the join operation.

I tried the following alternatives, but both of them don't work:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[(i.num < num) & (num - i.num < 2)],
               fid = f_id),
      by = .EACHI]

dt_lu[dt, {
  val = num[(i.num < num) & (num - i.num < 2)];
  list(tid = i.t_id,
       tnum = i.num,
       fnum = val,
       fid = f_id)},
  by = .EACHI]

Could someone explain me why I don't get the desired result with multiple conditions inside the join operation?

Community
  • 1
  • 1
Jaap
  • 81,064
  • 34
  • 182
  • 193

1 Answers1

8

The warning message gives away the issue. Also, using print() is quite helpful here.

dt_lu[dt, print(i.num < num & num - i.num < 2), by=.EACHI]
# [1]  TRUE  TRUE FALSE
# [1]  TRUE  TRUE FALSE
# [1]  TRUE  TRUE FALSE
# [1] FALSE  TRUE
# [1] FALSE  TRUE
# Empty data.table (0 rows) of 3 cols: place,place,num

Consider the first case where the condition evaluates to TRUE, TRUE, FALSE. There are 3 observations for this group. And your j-expression contains:

.(tid = i.t_id,
  tnum = i.num,
  fnum = num[i.num < num & num - i.num < 2],
  fid = f_id)

i.t_id and i.num are of length 1 (as they come from dt). But num[..condn..] will return length = 2, whereas f_id will return length = 3. Both the length=1 and length=2 items will be recycled to the length of the longest item/vector = 3. That leads to incorrect result. Since 3 isn't perfectly divisible by 2, it returns the warning.

What you intend to do is:

.(tid = i.t_id,
  tnum = i.num,
  fnum = num[i.num < num & num - i.num < 2],
  fid = f_id[i.num < num & num - i.num < 2])

or equivalently:

{  
  idx = i.num < num & num - i.num < 2
  .(tid  = i.t_id, tnum = i.num, fnum = num[idx], fid  = f_id[idx])
}

Putting it together:

dt_lu[dt, 
       {
         idx = i.num < num & num - i.num < 2
        .(tid  = i.t_id, tnum = i.num, fnum = num[idx], fid  = f_id[idx])
       }, 
by = .EACHI]
#    place tid tnum fnum fid
# 1:     a   1  5.1    6   1
# 2:     a   1  5.1    6   2
# 3:     a   4  5.1    6   1
# 4:     a   4  5.1    6   2
# 5:     a   3  5.1    6   1
# 6:     a   3  5.1    6   2
# 7:     d   2  6.2    7   2
# 8:     d   5  6.2    7   2
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thanx! If interpret your anwer well, I was lucky when using only one condition because the number of observations is the same for both `fnum` and `fid`. – Jaap Aug 16 '15 at 12:33