How to keep join column unchanged in data.table non-equi join?

Question

I was trying to remove rows in a data.frame where the value in column posn was not in ranges given in another data.frame, with data.table's non-equi join feature.

Here is how my data looks like:

library(data.table)
df.cov <-
    structure(list(posn = c(1, 2, 3, 165, 1000), att = c("a", "b",
    "c", "d", "e")), .Names = c("posn", "att"), row.names = c(NA,
    -5L), class = "data.frame")
df.exons <-
    structure(list(start = c(2889, 2161, 277, 164, 1), end = c(3329,
    2826, 662, 662, 168)), .Names = c("start", "end"), row.names = c(NA,
    -5L), class = "data.frame")

setDT(df.cov)
setDT(df.exons)

df.cov
#    posn att
# 1:    1   a
# 2:    2   b
# 3:    3   c
# 4:  165   d
# 5: 1000   e
df.exons # ranges of `posn` to include
#    start  end
# 1:  2889 3329
# 2:  2161 2826
# 3:   277  662
# 4:   164  662
# 5:     1  168

Here is what I tried:

df.cov[df.exons, on = .(posn >= start, posn <= end), nomatch = 0]
#    posn att posn.1
# 1:  164   d    662
# 2:    1   a    168
# 3:    1   b    168
# 4:    1   c    168
# 5:    1   d    168

You can see that the posn column in df.cov is also changed. The expected result looks like this:

#    posn att
# 1:  165   d
# 2:    1   a
# 3:    2   b
# 4:    3   c
# 5   165   d
# the row order doesn't matter. I'll sort by posn latter.
# It is also fine if the duplicated rows are removed, otherwise I'll do this in next step.

How can I get the desired output with data.table non-equi join?

`df.cov[df.cov[df.exons, on = .(posn >= start, posn <= end), nomatch = 0, which=TRUE]]` I've filed a FR for this sort of subsetting https://github.com/Rdatatable/data.table/issues/2158 (not sure if it makes sense, though, since no one has replied to it yet). To drop the dupes, just apply `unique` to the `which` vector before subsetting. — Frank, Jun 10 '17 at 13:12
I asked a somewhat related question a while ago: [Use `j` to select the join column of `x` and all its non-join columns](https://stackoverflow.com/questions/42329157/use-j-to-select-the-join-column-of-x-and-all-its-non-join-columns) — Henrik, Jun 10 '17 at 13:20
@Frank, thanks. Here is what I finally come based on yours: `df.cov[sort(unique(df.cov[df.exons, on = .(posn >= start, posn <= end), nomatch = 0, which=TRUE]))]`. I read your FR and I think such a feature would really be helpful in many cases like this one. — mt1022, Jun 10 '17 at 13:25
@Henrik, I read the linked post. These two are very similar. I thought the `posn` in my results is from `X`, actually it is from `Y` in `X[Y]` syntax. — mt1022, Jun 10 '17 at 13:36
Possible related link https://stackoverflow.com/questions/42633033/between-vs-inrange-in-data-table — akrun, Jun 10 '17 at 14:04
@akrun, thanks for the helpful link discussing about the different between `between` and `inrange`. — mt1022, Jun 10 '17 at 14:08
Also [this](https://stackoverflow.com/questions/36454565/efficient-way-to-filter-one-data-frame-by-ranges-in-another/) — David Arenburg, Jun 11 '17 at 13:27

Jaap · Accepted Answer · 2022-05-17T11:28:16.657

9

You could also use %inrange%:

df.cov[posn %inrange% df.exons]

which results in:

   posn att
1:    1   a
2:    2   b
3:    3   c
4:  165   d

As you can see this leaves the values of the posn-column unchanged.

Another possiblity with a non-equi join:

df.exons[df.cov
         , on = .(start <= posn, end >= posn)
         , mult = "first"
         , nomatch = 0
         , .(posn = i.posn, att)][]

edited May 17 '22 at 11:28

answered Jun 10 '17 at 13:28

Jaap

81,064
34
182
193

1

So easy. I never notice this secret function before. – mt1022 Jun 10 '17 at 13:30
@mt1022 added another option – Jaap Jun 10 '17 at 13:54
1

I didn't expect you will spend time to find a solution that fits strictly to my request in the post. Thanks very much for the "non-equi join" approach. – mt1022 Jun 10 '17 at 14:15

How to keep join column unchanged in data.table non-equi join?

1 Answers1

Linked