Inner join data.frames using multiple arguments in R dplyr

Question

I have two data frames that look like this

library(dplyr)

#> Attaching package: 'dplyr'


df1 <- tibble(chrom=c(1,1,1,2,2,2),
              start=c(100,200,300,100,200,300),
              end=c(150,250,350,120,220,320))

df2 <- tibble(chrom=c(1,1,1,2,2,2),
              start2=c(100,50,280,100,10,200),
              end2=c(125,100,320,115,15,350))

df1 
#> # A tibble: 6 × 3
#>   chrom start   end
#>   <dbl> <dbl> <dbl>
#> 1     1   100   150
#> 2     1   200   250
#> 3     1   300   350
#> 4     2   100   120
#> 5     2   200   220
#> 6     2   300   320
df2
#> # A tibble: 6 × 3
#>   chrom start2  end2
#>   <dbl>  <dbl> <dbl>
#> 1     1    100   125
#> 2     1     50   100
#> 3     1    280   320
#> 4     2    100   115
#> 5     2     10    15
#> 6     2    200   350

^{Created on 2023-01-09 with reprex v2.0.2}

When I am trying to join based on the following conditions I am taking a huge error any ideas why

inner_join(df2, df1, by = join_by(chrom, start< end2, end > start2))

The error

Error in `inner_join()`:
! Join columns in `x` must be present in the data.
✖ Problem with `start` and `end`.
Run `rlang::last_error()` to see where the error occurred.

This is the expected outcome

  chrom start end start2 end2
1:     1   100 150    100  125
2:     1    NA  NA     50  100
3:     1   300 350    280  320
4:     2   100 120    100  115
5:     2    NA  NA     10   15
6:     2   200 220    200  350
7:     2   300 320    200  350

r2evans · Accepted Answer · 2023-01-09T13:50:43.687

2

You list df2 first in the inner_join, its variables need to be listed on the LHS of the comparisons.

You can either swap df1/df2 or swap the order of the comparison variables (effectively the same given an inner join):

inner_join(df2, df1, by = join_by(chrom, end2 > start, start2 < end))
# # A tibble: 5 × 5
#   chrom start2  end2 start   end
#   <dbl>  <dbl> <dbl> <dbl> <dbl>
# 1     1    100   125   100   150
# 2     1    280   320   300   350
# 3     2    100   115   100   120
# 4     2    200   350   200   220
# 5     2    200   350   300   320

inner_join(df1, df2, by = join_by(chrom, start < end2, end > start2))
# # A tibble: 5 × 5
#   chrom start   end start2  end2
#   <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1     1   100   150    100   125
# 2     1   300   350    280   320
# 3     2   100   120    100   115
# 4     2   200   220    200   350
# 5     2   300   320    200   350

Your expected output suggests a sided-join, where these two are effectively equivalent:

left_join(df2, df1, by = join_by(chrom, end2 > start, start2 < end))
# # A tibble: 7 × 5
#   chrom start2  end2 start   end
#   <dbl>  <dbl> <dbl> <dbl> <dbl>
# 1     1    100   125   100   150
# 2     1     50   100    NA    NA
# 3     1    280   320   300   350
# 4     2    100   115   100   120
# 5     2     10    15    NA    NA
# 6     2    200   350   200   220
# 7     2    200   350   300   320
right_join(df1, df2, by = join_by(chrom, start < end2, end > start2))

edited Jan 09 '23 at 13:50

answered Jan 09 '23 at 13:34

r2evans

141,215
6
77
149

Indeed but these are NAs because there is no overlap. I think it does not matter. Although statistic wise might be interesting to some, which does not overlap. For me it's ok. I accept the answer – LDT Jan 09 '23 at 13:38
@LDT, to get the `NA` rows, you'd need a sided-join, perhaps `right_join(df1, df2, by = join_by(chrom, start < end2, end > start2))` (or the left-equivalent). If you aren't fully comfortable with left/right/inner, I suggest reading https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/q/5706437/3358272. Glad it worked! – r2evans Jan 09 '23 at 13:42
If you prefer having `df2` first, you can also manually specify whether the columns come from `x` or `y` with `x$` and `y$` like `join_by(chrom, y$start< x$end2, y$end > x$start2)` – Davis Vaughan Jan 09 '23 at 14:52
@DavisVaughan, is that a recommendation in the (dev) dplyr docs? I thought the use of `$` anywhere in the realm of dplyr's NSE is discouraged (there are always corner-case exceptions, I don't know that this is one of them). – r2evans Jan 09 '23 at 14:58
1

It is mentioned here https://dplyr.tidyverse.org/dev/reference/join_by.html#column-referencing. `join_by()` uses a special DSL that doesn't really have an equivalent anywhere else in the tidyverse so it is a little hard to compare. This is mainly for convenience to make some edge cases a little easier to read, I don't expect it to be used much. – Davis Vaughan Jan 09 '23 at 15:33

score 1 · Answer 2 · answered Jan 09 '23 at 13:35

1

Is this what you're looking for?

df3 <- inner_join(df2, df1, by = "chrom") %>%
filter(start< end2, end > start2)

answered Jan 09 '23 at 13:35

alexandradevon

46
5

I don't think this is what the OP was looking for, for two reasons: (1) the presence of `join_by` suggests the use of the dev version of `dplyr` (soon to be released as 1.1.0), which not only adds non-equi joins as a native join operation, it also does all joins much faster; and (2) This starts by doing a cartesian expansion that may be acceptable with this toy data but can be completely unusable in some datasets. (With the dev version of dplyr, this actually warns about the expansion. It still works, but I think the warning is a good thing imho.) – r2evans Jan 09 '23 at 13:38
1

Ahh, yes, thanks for this, my datasets are gigantic so this will be very helpful moving forward! – alexandradevon Jan 09 '23 at 13:44

Inner join data.frames using multiple arguments in R dplyr

2 Answers2