5

I am working with data.table and I want to do a non-equi left join/merge.

I have one table with car prices and another table to identify which car class each car belongs to:

data_priceclass <- data.table()
data_priceclass$price_from <- c(0, 0, 200000, 250000, 300000, 350000, 425000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000, 1300000, 1400000, 1500000, 1600000, 1700000, 1800000) 
data_priceclass$price_to <- c(199999, 199999, 249999, 299999, 349999, 424999, 499999, 599999, 699999, 799999, 899999, 999999, 1099999, 1199999, 1299999, 1399999, 1499999, 1599999, 1699999, 1799999, 1899999)
data_priceclass$price_class <- c(1:20, 99)

I use a non-equi join to merge the two tables. But the x[y]-join syntax of data.table removes duplicates.

cars <- data.table(car_price = c(190000, 500000))
cars[data_priceclass, on = c("car_price >= price_from", 
                             "car_price < price_to"),
     price_class := i.price_class,]
cars

Notice that the car with value 190000 is supposed to get matches on two rows in the data_priceclass table, but since x[y] removes duplicates, I can't see this in the output. Normally when I join I always use the merge function instead of x[y], because I'm losing control when I use x[y].

But the following does not work with non-equi joins:

merge(cars, data_priceclass,
      by = c("car_price >= price_from", 
             "car_price < price_to"),
      all.x = T , all.y = F)

Any tips how I can do a non-equi join with data.table that does not remove duplicates?

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Helen
  • 533
  • 12
  • 37
  • x[y] does not remove duplicates. It’s because you are using := , it performs a lookup into table x. – chinsoon12 Apr 24 '21 at 00:42
  • @chinsoon12 in effect it kinda does, but I agree with your point – Helen Apr 24 '21 at 02:49
  • Doesn’t solve my problem though – Helen Apr 24 '21 at 02:49
  • Try data_priceclass[cars, on=.(car_price>=price_from, car_price – chinsoon12 Apr 24 '21 at 12:13
  • @chinsoon12 that gives me an error message – Helen May 03 '21 at 09:53
  • 2
    I think what @chinsoon12 meant is try `data_priceclass[cars, on = .(price_from <= car_price, price_to > car_price)]`. The reason that you didn't get duplicates in your attempt is because `car` data has only two rows, while you are trying to assign 3 rows using `:=`. Hence, you need to increase the size of `car` and you can't `:=`. The reason for using `data_priceclass[cars]` instead of `cars[data_priceclass]` is because `X[Y]` it means *"for each value in Y find all the matching records in X"* In your case you want to to find all the values in `data_priceclass` for each value in `cars`. – David Arenburg May 03 '21 at 10:29
  • @DavidArenburg Thank you for the explanation! Yes, I figured that is what they were getting at, but I'm specifically looking for a left-join here, and that is maybe not what they were giving me (although that would work in this specific example)... unless I'm misunderstanding, of course... – Helen May 03 '21 at 10:33
  • I think it will work in any case. Try it out for other cases and see. – David Arenburg May 03 '21 at 10:52
  • @DavidArenburg It will not work in the case where you have no matches, so I don't think I can use this solution. For example, if you set `cars <- data.table(car_price = c(1900000, 5000000))`, it won't work – Helen May 03 '21 at 11:07

1 Answers1

5

As noted in comments, a left join on cars is done by using cars as subsetting condition i in the DT[i,j,by] syntax.
This puts cars on the right, which might be counter-intuitive compared to SQL, and I found this tutorial useful to compare both syntaxes.

cars <- data.table(car_price = c(190000, 500000))
data_priceclass[cars, .(car_price,x.price_from,x.price_to,price_class),on = .(price_from <= car_price,price_to > car_price)]

   car_price x.price_from x.price_to price_class
1:    190000        0e+00     199999           1
2:    190000        0e+00     199999           2
3:    500000        5e+05     599999           8

If you increase car price:

cars <- cars * 10
data_priceclass[cars, .(car_price,x.price_from,x.price_to,price_class),on = .(price_from <= car_price,price_to > car_price)]

   car_price x.price_from x.price_to price_class
1:   1900000           NA         NA          NA
2:   5000000           NA         NA          NA
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • Thank you for answer! I think I might have misunderstood what a left join in data.table is. From the documentation (cran.r-project.org/web/packages/data.table/data.table.pdf) it says: `X[DT, on="x"] # left join`. From what @David Arenburg writes above: `X[Y] it means "for each value in Y find all the matching records in X"`. I'm coming from SQL where whatever is to the left of the `left join`, .e.g. `A left join B on...` is what is interpreted as `for each value in A find all the matching records in B`, but here it seems to be the opposite... I'm very confused right now, could you clarify? – Helen May 03 '21 at 12:16
  • I found this [tutorial](https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html) useful to compare `data.table` to `SQL` syntax – Waldi May 03 '21 at 12:23
  • You're right, X[Y] is taking Y as left table in SQL, which might be confusing, but `data.table`never said `left`. In `data.table` syntax it means `merge(X, Y, all.x=TRUE)` – Waldi May 03 '21 at 12:26
  • data.table said `left` in the documentation (link above), but it is confusing what object in `X[DT]` that `left` is referring to. Now I'm confused. You're saying that `X[Y]` is taking Y as a left table, but `merge(X, Y, all.x=TRUE)` is taking `X` as the left table. Did you mean `all.y=TRUE`? – Helen May 03 '21 at 12:35
  • Sorry for adding confusion ;-) I meant `all.y=true` – Waldi May 03 '21 at 12:38
  • if you have some extra energy, could you quickly explain why when adding `:=` the following result changes so substantially: here it looks like I'm doing a cars leftjoin data_priceclass: `cars[data_priceclass, price_class := i.price_class, on = .(car_price >= price_from, car_price < price_to)]` , but here it looks like I'm doing a data_priceclass left join cars: `cars[data_priceclass, .(price_class), on = .(car_price >= price_from, car_price < price_to)] ` – Helen May 03 '21 at 13:06
  • 1
    This might help : `cars[data_priceclass, .(x.car_price,i.price_class), on = .(car_price >= price_from, car_price < price_to)]`. Modify by reference modifies the maching rows sequentially, so 190000 is first updated to 1, then 2, and 500000 is updated to 8 – Waldi May 03 '21 at 13:35
  • 1
    You can verify above by ordering price_class the other way round : `data_priceclass <- data_priceclass[order(-price_class)]`, which leads to price_class = 1 instead of 2 after `cars[data_priceclass, price_class := i.price_class, on = .(car_price >= price_from, car_price < price_to)]` – Waldi May 03 '21 at 13:44