0

I have two data frames:

CHR  POS
10   289968
10   580270

CHR  START  STOP
10   250000 300000
10   700422 700500

Search #1

> subset(df1, CHR==df2$CHR & POS >= df2$START & POS <= df2$STOP)
    CHR POS
  1 10  289968

But if I flip the order of rows in df2, then the search does not work. For example, df2 is now like this

CHR  START  STOP
10   700422 700500
10   250000 300000

    > subset(df1, CHR==df2$CHR & POS >= df2$START & POS <= df2$STOP)
    [1] CHR POS
    <0 rows> (or 0-length row.names)

Why does the order matter here?

smci
  • 32,567
  • 20
  • 113
  • 146
cryptic0
  • 225
  • 2
  • 12
  • It is because the condition `POS >= df2$START` (or `CHR==df2$CHR` or `POS <= df2$STOP`) compares two vectors **element-by-element**. The logical expression exactly returns a vector indicating which rows to keep for the second argument of `subset`. – raymkchow Jan 25 '17 at 03:53
  • 1
    I don't think I understand. Can you elaborate a bit? – cryptic0 Jan 25 '17 at 03:55
  • `df1$CHR==df2$CHR` returns `[1] TRUE TRUE` because the first and second elements of `df1$CHR` (`10`) are the same as `df2$CHR`. The order matters because `df1$POS >= df2$START` and `df1$POS <= df2$STOP` returns `[1] FALSE TRUE` and `[1] TRUE FALSE` after you flip the order of the row, instead of `[1] TRUE FALSE` and `[1] TRUE TRUE` before flipping. – raymkchow Jan 25 '17 at 04:04
  • So in other words the search only works if the conditions are all satisfied on the exact same row in both data frames? That's too much to ask for. Perhaps there is a more efficient way of doing this that I am not aware of? – cryptic0 Jan 25 '17 at 04:15
  • You are right. Conditions are evaluated on the exact same row. `subset` and `[` are the only two methods I used for subsetting a `data.frame`. You can read about [this question](http://stackoverflow.com/questions/27303534/faster-way-to-subset-on-rows-of-a-data-frame-in-r) for subsetting a `data.table`. I personally don't use data.table but it should be faster. – raymkchow Jan 25 '17 at 04:27
  • What about this solution? It will only show me TRUE/FALSE, but does it at least work without the constraints experienced by the above example? `df1$CHR == df2$CHR & df1$POS >= df2$START & df1$POS <= df2$STOP` – cryptic0 Jan 25 '17 at 04:33
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/133936/discussion-between-raymkchow-and-cryptic0). – raymkchow Jan 25 '17 at 04:46

1 Answers1

1

For not leaving blank for the questions, I just post the answer.

For each row of df1 you need to check if any of rows of df2 satisfy the given condition START <= POS <= STOP and subset df1 based on the condition.

The steps are the following.
1. Use sapply with a checking function for every rows of df1.
2. subsetting with subset

subset(df1, sapply(1:nrow(df1), 
       function(i) any(df1$CHR[i] == df2$CHR & df1$POS[i] >= df2$START & df1$POS[i] <= df2$STOP)))
raymkchow
  • 929
  • 11
  • 20