5

I've found an odd behavior when running a non-equi join (from R's data.table library) and I can't figure out why this is happening.

Why is it that, when running a non-equi join, if I want to preserve the original value of the left table, I need to write x.colname instead of just colname inside the j attribute of the join?

Here's a small reproducible example of what I'm talking about:

library(tidyverse)
library(data.table)

# Setting seed for reproducibility
set.seed(666)

# data.table that contains roadway segments.
# The "frm_dfo" and "to_dfo" columns represent the start and end mileposts 
# of each roadway segment. For example, the segment with road_ID=101 refers 
# to the portion of IH20 that starts at milepost 10 and ends at milepost 20.
roads = data.table(road_id=101:109,
                   hwy=c('IH20','IH20','IH20','SH150','SH150','SH150','TX66','TX66','TX66'),
                   frm_dfo=c(10,20,30,10,20,30,10,20,30),
                   to_dfo=c(20,30,40,20,30,40,20,30,40),
                   seg_name=c('Seg 1','Seg 2', 'Seg 3','Seg 10','Seg 20', 'Seg 30','Seg 100','Seg 200', 'Seg 300'))

# data.table that contains crashes. 
# The "dfo" column represents the milepost of the roadway on which the 
# crash occurs. For example, the crash with crash_id=1 happens on milepost 33.23105 of IH20.
crashes = data.table(crash_id=1:30,
                     hwy=rep(c('IH20','SH150','BOB11'),each=10),
                     dfo=runif(min=10,max=40, n=30))

# Non-equi join that finds which segment each crash happens on.
joined_data_v1 = crashes %>%
                  .[roads, 
                    j  = list(crash_id, hwy, x.dfo, seg_name, frm_dfo, to_dfo),
                    on = list(hwy=hwy, dfo >= frm_dfo, dfo <= to_dfo)] %>%
                  arrange(crash_id, by_group = TRUE)

# Again, joining crashes and roadway segments. 
# Here, though, note that I've swapped x.dfo for just dfo inside the `j` argument 
joined_data_v2 = crashes %>%
                  .[roads, 
                    j  = list(crash_id, hwy, dfo, seg_name, frm_dfo, to_dfo),
                    on = list(hwy=hwy, dfo >= frm_dfo, dfo <= to_dfo)] %>%
                  arrange(crash_id, by_group = TRUE)

Here is a snapshot of joined_data_v1 (using x.dfo in the j argument): joined_data_v1

And here is a snapshot of joined_data_v2 (using dfo in the j argument): joined_data_v2

Note how, in joined_data_v1, the column called x.dfo contains the exact values from the dfo column from the crashes data.table. However, in joined_data_v2, the column called dfo contains the values from the frm_dfo column from the roads data.table (instead of the actual data from the crashes data.table's dfo column).

What is going on here? Why does this behave so oddly? Why do the values contained in the dfo/x.dfo column of the resulting data.table not always reflect exactly what was contained in the original dfo column from the crashes data.table?

I tried looking into some of the documentation for the non-equi join, but couldn't really find anything that could help me here.

Here is a related related question, but they don't mention why this behavior happens.

Felipe D.
  • 1,157
  • 9
  • 19
  • 2
    I share the curiosity, but I fear this will be mostly about *opinion*, unless the originating authors choose to weigh in with their original design decisions. This has been brought up before; despite many of the authors being active on SO, they appear to have not weighed in, either due to inattention or not wishing to get into the discussion. My *guess* is that something had to be chosen, and there were several not-unreasonable alternatives, and they chose one at the time. Perhaps they regret, but it is really hard to change that default behavior without breaking *a lot of depending packages*. – r2evans Jul 13 '21 at 17:23
  • Thanks for the insight! Yeah, sounds like they had a hard decision to make. On a related note: do you happen to know what exactly that `x.` prefix is supposed to mean/do? Or where I can find more information about when it's needed or where else it is useful? I wasn't able to find much when I searched package's docs. Thx again! – Felipe D. Jul 13 '21 at 17:30
  • 2
    Nope, no idea about the `x.` prefix, sorry. Often my exploits in non-trivial `[`-joins (non-equi or otherwise) require iteration through many similar attempts to find what I want; this includes names and values of non-equi columns. – r2evans Jul 13 '21 at 17:41
  • No worries. Still, thanks for the helpful info!!! =) – Felipe D. Jul 13 '21 at 17:55
  • 1
    I think the `.x` comes from the merging nomentlature: your are subsetting a data.table `x` with an other data.table, by placing it in the `i` part of the `x[i,j,by]` structure. So the calling `.i` refers to columns in the second one, and `.x` in the first one. You can do an update join using `.i`: https://stackoverflow.com/questions/44433451/r-data-table-update-join – denis Jul 13 '21 at 20:26
  • 1
    you might want to check out https://github.com/Rdatatable/data.table/issues/1700 . And you dont need non-equi join to get this behaviour `a <- data.table(x=4:5); b <- data.table(y=1); a[b, on=.(x=y), .(x, y)]; a[b, on=.(x=y), .(x.x, i.y)]` my rudimentary understanding is that j will use values from i when they are also keys used in the join. – chinsoon12 Jul 13 '21 at 23:35
  • 2
    A bit more explanations : https://www.r-bloggers.com/2021/02/the-unequalled-joy-of-non-equi-joins/ and https://stackoverflow.com/a/44343424/13513328 – Waldi Jul 14 '21 at 06:48
  • 1
    I think this is the canonical PR where the background of your issue is discussed: [SQL-like column return for non-equi and rolling joins](https://github.com/Rdatatable/data.table/pull/3093). (The PR is marked as closed, but I reckon it's "just" because it's transferred to [Both columns for rolling and non-equi joins](https://github.com/Rdatatable/data.table/pull/3093)) – Henrik Jul 14 '21 at 20:12

0 Answers0