I've found an odd behavior when running a non-equi join (from R's data.table library) and I can't figure out why this is happening.
Why is it that, when running a non-equi join, if I want to preserve the original value of the left table, I need to write x.colname
instead of just colname
inside the j
attribute of the join?
Here's a small reproducible example of what I'm talking about:
library(tidyverse)
library(data.table)
# Setting seed for reproducibility
set.seed(666)
# data.table that contains roadway segments.
# The "frm_dfo" and "to_dfo" columns represent the start and end mileposts
# of each roadway segment. For example, the segment with road_ID=101 refers
# to the portion of IH20 that starts at milepost 10 and ends at milepost 20.
roads = data.table(road_id=101:109,
hwy=c('IH20','IH20','IH20','SH150','SH150','SH150','TX66','TX66','TX66'),
frm_dfo=c(10,20,30,10,20,30,10,20,30),
to_dfo=c(20,30,40,20,30,40,20,30,40),
seg_name=c('Seg 1','Seg 2', 'Seg 3','Seg 10','Seg 20', 'Seg 30','Seg 100','Seg 200', 'Seg 300'))
# data.table that contains crashes.
# The "dfo" column represents the milepost of the roadway on which the
# crash occurs. For example, the crash with crash_id=1 happens on milepost 33.23105 of IH20.
crashes = data.table(crash_id=1:30,
hwy=rep(c('IH20','SH150','BOB11'),each=10),
dfo=runif(min=10,max=40, n=30))
# Non-equi join that finds which segment each crash happens on.
joined_data_v1 = crashes %>%
.[roads,
j = list(crash_id, hwy, x.dfo, seg_name, frm_dfo, to_dfo),
on = list(hwy=hwy, dfo >= frm_dfo, dfo <= to_dfo)] %>%
arrange(crash_id, by_group = TRUE)
# Again, joining crashes and roadway segments.
# Here, though, note that I've swapped x.dfo for just dfo inside the `j` argument
joined_data_v2 = crashes %>%
.[roads,
j = list(crash_id, hwy, dfo, seg_name, frm_dfo, to_dfo),
on = list(hwy=hwy, dfo >= frm_dfo, dfo <= to_dfo)] %>%
arrange(crash_id, by_group = TRUE)
Here is a snapshot of joined_data_v1
(using x.dfo
in the j
argument):
And here is a snapshot of joined_data_v2
(using dfo
in the j
argument):
Note how, in joined_data_v1
, the column called x.dfo
contains the exact values from the dfo
column from the crashes
data.table. However, in joined_data_v2
, the column called dfo
contains the values from the frm_dfo
column from the roads
data.table (instead of the actual data from the crashes
data.table's dfo
column).
What is going on here? Why does this behave so oddly? Why do the values contained in the dfo/x.dfo
column of the resulting data.table not always reflect exactly what was contained in the original dfo
column from the crashes
data.table?
I tried looking into some of the documentation for the non-equi join, but couldn't really find anything that could help me here.
Here is a related related question, but they don't mention why this behavior happens.