0

I have a data.frame1:

BIN     CHR     BP1             BP2             Score   Value
12      chr1    29123222        29454711        -5.7648 599
116     chr13    45799118        45986770        -4.8403 473
117     chr5    46327104        46490961        -5.3036 536
121     chr6    50780759        51008404        -4.4165 415
133     chr18    63634657        63864734        -4.8096 469
147     chr1   77825305        78062178        -5.4671 559

I have a second data.frame2 like:

    CHR     SNP             A1      A2      BP        INFO    OR      SE      P      NGT   
    chr1    rs10900604      A       G       29204555  0.774   1.01582 0.0143  0.2723  0
    chr3    rs12132517      A       G       79880711  0.604   0.98334 0.0253  0.5071  2
    chr14    rs11240777      A       G       79895429  0.818   0.98817 0.0139  0.3907  27
    chr18    rs147634896     T       C       63789900  0.623   1.02634 0.0259  0.3161  0
    chr6    rs143609865     A       T       77934001  0.617   1.01562 0.0317  0.6254  0

I am interested in keeping all the rows in data.frame2 that match the following crirteria: they much have the same CHR and a BP value between BP1 and BP2 of any of the rows in data.frame1.

For example, row one of data.frame2 has "chr1" and also has a "BP" between one of the BP ranges from data.frame1. Notice it doesn't fall into row 7's range, but it does fall into row 1's range. I would thus like to keep this row in data.frame2

Another example, row 4 of data.frame2 has "chr18" and BP 63789900 that falls within the BP range (between BP1 and BP2) of row 5 in data.frame1. I would thus like to keep this row in data.frame2

Final example. Notice that row 5 in data.frame2 has a BP 77934001 that falls within BP1 and BP2 range for row 6 in data.frame1. Yet in data.frame2 "chr6" does not match with "chr1". I would like to delete this row.

I would also like to delete all the other rows that don't match both CHR and BP range at the same time.

I was thinking maybe if loop that had CHR1=CHR2, and BP>BP1 and BP

Evan
  • 1,477
  • 1
  • 17
  • 34

2 Answers2

4

This ought to work using base R:

# merge the relevant data
dfmerge = merge(df1[c("CHR", "BP1", "BP2")], df2, by = "CHR")

# delete unwanted rows
dfmerge = dfmerge[(dfmerge$BP > dfmerge$BP1 & dfmerge$BP < dfmerge$BP2),]

# clean up columns
dfmerge[c("BP1", "BP2")] = list(NULL)

In generally, SQL can do this nice and concisely:

library(sqldf)

sqldf("select df2.*
       from df2 inner join df1
       on df2.CHR = df1.CHR
       and df2.BP between df1.BP1 and df2.BP2")
Evan
  • 1,477
  • 1
  • 17
  • 34
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Does this account for the fact that there will be multiple "chr1" in both files. What happens when you merge with non-unique values? – Evan Dec 04 '15 at 02:02
  • 1
    It does. Joins will match all possible pairs. The `merge` default is an inner join that will only keep records that have a match. Try it yourself and see! – Gregor Thomas Dec 04 '15 at 03:31
  • I got this error `"Error in `[.data.frame`(dfmerge, dfmerge$BP > dfmerge$BP1 & dfmerge$BP < : undefined columns selected"` after the #delete rows line – Evan Dec 04 '15 at 05:45
  • I added paranthesis around the statement, and put in a comma to make it `[dfmerge$BP > dfmerge$BP1 & dfmerge$BP < dfmerge$BP2),] and it worked! – Evan Dec 04 '15 at 05:49
  • 1
    @Evan thanks for the edits! In the future, if you share data with `dput()` in your question it is copy/paste-able and you are much more likely to get tested solutions ;) – Gregor Thomas Dec 04 '15 at 07:24
  • Thanks so much! That will help with my harder stuck I'm working on! Upvoted – Evan Dec 04 '15 at 07:39
3

Here is a dplyr approach:

d %>%
  left_join(d2) %>%
  filter(BP >= BP1 & BP <= BP2)
Jacob H
  • 4,317
  • 2
  • 32
  • 39
  • Does this make sure that the "CHR" values match as well? – Evan Dec 04 '15 at 02:02
  • 1
    @Evan yep, the call to `left_join` joins the data sets on the `CHR` variables, since it is the only shared variable. To see, try running `d %>%; left_join(d2)`, this will spit out `Joining by: "CHR"`. You can make this more explicate with `left_join(d2, by = CHR)` – Jacob H Dec 04 '15 at 02:08