3

I have some data that I am trying to filter in dplyr, but I can't seem to get the code right to get it done. Here are the two data sets:

df1 <- data.frame(Chromosome = c("chr1", "chr1", "chr2", "chr3", "chr4"),
                  Position = c(5 ,12, 20, 25, 50), stringsAsFactors = FALSE)

> df1
  Chromosome Position
1       chr1        5
2       chr1       12
3       chr2       20
4       chr3       25
5       chr4       50

df2 <- data.frame(Chromosome = c("chr1", "chr3"), From = c(1, 20),
                  To = c(10, 80),stringsAsFactors = FALSE)

> df2
  Chromosome From To
1       chr1    1 10
2       chr3   20 80

What I would like to do is select those rows from the first table where the chromosome numbers are identical between the tables and the position is contained between the "From" and the "To" in the second table. So the output here would be:

  Chromosome Position
1       chr1        5
2       chr3       25

Any suggestions on how to write this in R? In particular I'd love to use dplyr functions, but not required.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
Steven
  • 239
  • 2
  • 10
  • 3
    @DavidArenburg, I don't think so.. I think we're talking [this](http://stackoverflow.com/a/25655497/559784).. :-) – Arun Jan 20 '16 at 14:24

2 Answers2

3
library(dplyr)
left_join(df1, df2) %>%
  filter(Position > From, Position < To ) %>%
  select(-From, -To)

  Chromosome Position
1       chr1        5
2       chr3       25

This assumes no duplicated values of Chromosome in your data.frame with From and To.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • Thanks for the advice, everyone. There are definitely duplicated values of Chromosome, but if I use inner_join instead of left_join above it seems to work just fine. The size of the intermediate file is enormous. I'll also play around with some of the data.table "foverlap" functions on the referenced thread above. – Steven Jan 20 '16 at 15:45
0

If you are open to data.table one method would be

library(data.table)
setDT(df1, key="Chromosome")
setDT(df2, key="Chromosome")

df1[ df2,  { idx = i.From <= Position & Position <= i.To
              .(Position = Position[idx])
          },
     by=.EACHI
     ]

#   Chromosome Position
#1:       chr1        5
#2:       chr3       25

Inspired by this answer to a previous question of mine

Community
  • 1
  • 1
tospig
  • 7,762
  • 14
  • 40
  • 79