Searching for a value from one dataframe column in a second dataframe of intervals

Question

I have two data frames:

CHR  POS
10   289968
10   580270

CHR  START  STOP
10   250000 300000
10   700422 700500

Search #1

> subset(df1, CHR==df2$CHR & POS >= df2$START & POS <= df2$STOP)
    CHR POS
  1 10  289968

But if I flip the order of rows in df2, then the search does not work. For example, df2 is now like this

CHR  START  STOP
10   700422 700500
10   250000 300000

    > subset(df1, CHR==df2$CHR & POS >= df2$START & POS <= df2$STOP)
    [1] CHR POS
    <0 rows> (or 0-length row.names)

Why does the order matter here?

It is because the condition `POS >= df2$START` (or `CHR==df2$CHR` or `POS <= df2$STOP`) compares two vectors **element-by-element**. The logical expression exactly returns a vector indicating which rows to keep for the second argument of `subset`. — raymkchow, Jan 25 '17 at 03:53
`df1$CHR==df2$CHR` returns `[1] TRUE TRUE` because the first and second elements of `df1$CHR` (`10`) are the same as `df2$CHR`. The order matters because `df1$POS >= df2$START` and `df1$POS <= df2$STOP` returns `[1] FALSE TRUE` and `[1] TRUE FALSE` after you flip the order of the row, instead of `[1] TRUE FALSE` and `[1] TRUE TRUE` before flipping. — raymkchow, Jan 25 '17 at 04:04
So in other words the search only works if the conditions are all satisfied on the exact same row in both data frames? That's too much to ask for. Perhaps there is a more efficient way of doing this that I am not aware of? — cryptic0, Jan 25 '17 at 04:15
You are right. Conditions are evaluated on the exact same row. `subset` and `[` are the only two methods I used for subsetting a `data.frame`. You can read about [this question](http://stackoverflow.com/questions/27303534/faster-way-to-subset-on-rows-of-a-data-frame-in-r) for subsetting a `data.table`. I personally don't use data.table but it should be faster. — raymkchow, Jan 25 '17 at 04:27
What about this solution? It will only show me TRUE/FALSE, but does it at least work without the constraints experienced by the above example? `df1$CHR == df2$CHR & df1$POS >= df2$START & df1$POS <= df2$STOP` — cryptic0, Jan 25 '17 at 04:33
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/133936/discussion-between-raymkchow-and-cryptic0). — raymkchow, Jan 25 '17 at 04:46

score 1 · Accepted Answer · answered Jan 25 '17 at 05:31

For not leaving blank for the questions, I just post the answer.

For each row of df1 you need to check if any of rows of df2 satisfy the given condition START <= POS <= STOP and subset df1 based on the condition.

The steps are the following.
1. Use sapply with a checking function for every rows of df1.
2. subsetting with subset

subset(df1, sapply(1:nrow(df1), 
       function(i) any(df1$CHR[i] == df2$CHR & df1$POS[i] >= df2$START & df1$POS[i] <= df2$STOP)))

Searching for a value from one dataframe column in a second dataframe of intervals

1 Answers1