How to match rows within in a range of another dataset

Question

I have a genetic dataset where I am matching chromosome positions in the genome of 1 file if they fit within chromosome position ranges given in another file.

There are similar questions to this that I have tried, mostly with time intervals, but they haven't worked due to me needing to make sure the chromosome number is also matching (so I don't match identical positions but on differing chromosomes)

My data looks like this:

#df1 - chromosome positions to find within df2 ranges:

Chromosome   Position   Start   End
    1           101      101    101
    2           101      101    101
    3           600      600    600

#df2 - genomic ranges
Chromosome Start End      CpG
    1       50   200       10
    1       300  400        2
    4       100  200        5

Expected matched output (also ultimately I am looking to find the matching CpG column for df1 data):

Chromosome   Position    Start   End   CpG
    1           101        50    200    10  #only row of df1 that's within a range on df2 on the same chromosome

I am currently trying to do this with:

df <-df1 %>%
  left_join(df2, 
            by = "Chromosome") %>% 
  filter(Position >= Start & Position <= End)

Error: Problem with `filter()` input `..1`.
x object 'Start' not found
i Input `..1` is `Position >= Start & Position <= End`.

I don't understand how I am getting this error, the Start and End columns exist in both files and are all integer data classes - is there something I'm missing or another way I can solve this?

My actual data is quite large so also if a data.table solution works for this I am also trying to find it - I've tried but I'm a novice and haven't got far:

df1[df2, on = .(Chromosome, Position > End, Position < Start ) ]

Edit: trying with foverlaps:

setkey(df1)
df2[, End := Start]
foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all", nomatch = 0L)

Error in foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all",  : 
  length(by.x) != length(by.y). Columns specified in by.x should correspond to columns specified in by.y and should be of same lengths.

[Overlap join with start and end positions](https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions) — Henrik, Jul 17 '20 at 10:20
Thank you for this, I've had a go but it gives me an error I don't know how to address, I've edited it into my question — DN1, Jul 17 '20 at 10:26
You might need to read down further in that link. See answer below. — DaveTurek, Jul 19 '20 at 18:19

DaveTurek · Accepted Answer · 2020-07-19T18:27:17.883

For a data.table solution, you should have looked at the second answer by Arun on non-equi joins in the link provided by @Henrik. Overlap join with start and end positions

Based on that, we have

library(data.table)

df1 <- data.table(Chromosome=1:3,Position=c(101,101,600),
                  Start=c(101,101,600),End=c(101,101,600))

df2 <- data.table(Chromosome=c(1,1,4),
                  Start=c(50,300,100),End=c(200,400,200),CpG=c(10,2,5))

df1[df2,.(Chromosome,Position=x.Position,Start,End,CpG),
    on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]

giving

       Chromosome Position Start End CpG
1:              1      101   101 101  10

That's not quite right because it takes Start and End from df1 rather than df2. Why do you even have Start and End in df1?

One way to deal with that is to not include them in the join statement:

df1[,.(Chromosome,Position)][df2,
    .(Chromosome,Position=x.Position,Start,End,CpG),
   on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]

giving

   Chromosome Position Start End CpG
1:          1      101    50 200  10

[EDIT to note that @Carles Sans Fuentes identified the same issue in his dplyr answer.]

As a check on cases with more matches, I added some more data:

 df1 <- data.table(Chromosome=c(1,1:4),Position=c(350,101,101,600,200),
                       Start=c(350,101,101,600,200),End=c(350,101,101,600,200))
    
    df1
       Chromosome Position Start End
    1:          1      350   350 350
    2:          1      101   101 101
    3:          2      101   101 101
    4:          3      600   600 600
    5:          4      200   200 200
    
    
    
        df1[,.(Chromosome,Position)][df2,
            .(Chromosome,Position=x.Position,Start,End,CpG),
           on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
    
       Chromosome Position Start End CpG
    1:          1      101    50 200  10
    2:          1      350   300 400   2
    3:          4      200   100 200   5

Which I guess to be what you'd want.

score 0 · Answer 2 · edited Apr 02 '21 at 06:28

The problem is related to the left_join() , which stacks columns from different datasets with the same name in one dataset. Since two columns cannot have the same column name in one same dataset, the column Start and End gets its name changed to Start.x, and Start.y, End.x, End.y.

Therefore, you must either remove the Start and End columns from the first dataset as:

library(data.table)
library(tidyr)
library(dplyr)
df1 <- fread("Chromosome   Position   Start   End
    1           101      101    101
             2           101      101    101
             3           600      600    600")
df2<- fread("Chromosome Start End      CpG
    1       50   200       10
    1       300  400        2
    4       100  200        5")

df <-df1 %>%select(Chromosome, Position)%>%
  left_join(df2, 
            by = "Chromosome") %>% 
  filter(Position >= Start & Position <= End)

or refer to the real name of the columns and then remove the extra cols:

df <-df1 %>%
  left_join(df2, 
            by = "Chromosome") %>% 
  filter(Position >= Start.y & Position <= End.y)

Cheers !

How to match rows within in a range of another dataset

2 Answers2