How to find overlapping regions between two data frames based on conditions

Question

I have two data frames, one called strain_1 and the other called strain_2. Each data frame has 4 columns (st_A, ed_A, st_B, ed_B : for "start" and "end" positions), but a different number of rows. st_A, ed_A and st_B, ed_B are the "start" and "end" positions of the block_A and block_B, respectively (see image 1 and the example below).

I am looking to identify the common overlapping blocks between strain_1 and strain_2.

Taking an example from image 1:

strain_1 <- data.frame(st_A=c(7,25,35,48,89), ed_A=c(9,28,38,51,91),
                       st_B=c(123,97,140,73, 13), ed_B=c(127,98,145,76,16))

strain_2 <- data.frame(st_A=c(5,20,36,49) ,  ed_A=c(8,25,39,50),    
                       st_B=c(124,95,141,105) ,  ed_B=c(129,100,147,110))

From this example, we see three overlapping regions (image 1):

The overlapping region is defined by : the min value of st_A (or st_B) and max value of ed_A (or ed_B) for block_A and block_B, respectively (see image 2: green box = common region).

The objective is to create a new data frame with these common regions (pair of blocks)

## result_desired 
result_desired <- data.frame(st_A=c(5,20,35), ed_A=c(9,28,39),
                             st_B=c(123,95,140), ed_B=c(129,100,147))

There are 16 possible combinations (see image 3), depending on the size of each block.

Is there a fast way to do this? knowing that I have data with several thousand rows.

I tried some code, based on @Gregor coments, but I can't get the desired result:

## require(dplyr)
require(dplyr)

## data 
strain_1 <- data.frame(st_A=c(7,25,35,48,89), ed_A=c(9,28,38,51,91),
                       st_B=c(123,97,140,73, 13), ed_B=c(127,98,145,76,16))

strain_2 <- data.frame(st_A=c(5,20,36,49) ,  ed_A=c(8,25,39,50),    
                       st_B=c(124,95,141,105) ,  ed_B=c(129,100,147,110))

# merge data to get cross join 
cj_data <-merge(strain_1,strain_2, by = NULL)

# Check block1 and block2
cj_filtered <- cj_data %>% mutate(c_block1= case_when(st_A.x <= st_A.y & ed_A.x <= ed_A.y | 
                                                      st_A.x >= st_A.y & ed_A.x >= ed_A.y |
                                                      st_A.x <= st_A.y & ed_A.x >= ed_A.y | 
                                                      st_A.x >= st_A.y & ed_A.x <= ed_A.y ~ "overlap_OK",
                                                      TRUE ~ "NO"),

                                  c_block2= case_when(st_B.x <= st_B.y & ed_B.x <= ed_B.y | 
                                                      st_B.x >= st_B.y & ed_B.x >= ed_B.y |
                                                      st_B.x <= st_B.y & ed_B.x >= ed_B.y | 
                                                      st_B.x >= st_B.y & ed_B.x <= ed_B.y ~ "overlap_OK",
                                                      TRUE ~ "NO"))

## cj_filtered:
st_A.x  ed_A.x  st_B.x  ed_B.x  st_A.y  ed_A.y  st_B.y  ed_B.y  c_block1    c_block2
7       9       123     127     5       8       124     129     overlap_OK  overlap_OK
25      28      97      98      5       8       124     129     overlap_OK  overlap_OK
35      38      140     145     5       8       124     129     overlap_OK  overlap_OK
48      51      73      76      5       8       124     129     overlap_OK  overlap_OK
89      91      13      16      5       8       124     129     overlap_OK  overlap_OK
7       9       123     127     20      25      95      100     overlap_OK  overlap_OK
25      28      97      98      20      25      95      100     overlap_OK  overlap_OK
35      38      140     145     20      25      95      100     overlap_OK  overlap_OK
48      51      73      76      20      25      95      100     overlap_OK  overlap_OK
89      91      13      16      20      25      95      100     overlap_OK  overlap_OK
7       9       123     127     36      39      141     147     overlap_OK  overlap_OK
25      28      97      98      36      39      141     147     overlap_OK  overlap_OK
35      38      140     145     36      39      141     147     overlap_OK  overlap_OK
48      51      73      76      36      39      141     147     overlap_OK  overlap_OK
89      91      13      16      36      39      141     147     overlap_OK  overlap_OK
7       9       123     127     49      50      105     110     overlap_OK  overlap_OK
25      28      97      98      49      50      105     110     overlap_OK  overlap_OK
35      38      140     145     49      50      105     110     overlap_OK  overlap_OK
48      51      73      76      49      50      105     110     overlap_OK  overlap_OK
89      91      13      16      49      50      105     110     overlap_OK  overlap_OK

Thanks for your help.

Do a full (cross) join, filter out rows with no overlap, and go from there. With several thousand rows in each data frame, you'll have several million rows in the joined data - still reasonable. If you run into memory issues, using the `data.table` package will help. — Gregor Thomas, Nov 07 '19 at 14:15
You also *don't* need to code each case separately. Once joined, use `pmin` and `pmax` to get the min/max st and ed for each _A and _B on each row. The overlap will always be from the `pmax` of the mins to the `pmin` of the maxs. If the `pmax` of the mins is greater than the `pmin` of the maxs, there is no overlap. — Gregor Thomas, Nov 07 '19 at 14:19
Thank you for your reply. In fact I am new to R, so I will try to find packages to start full(cross) join and then try to filter the data. — , Nov 07 '19 at 14:50
See the [R-FAQ on joins](https://stackoverflow.com/q/1299871/903061). If you use `merge(strain_1, strain_2, by = NULL)` you'll get a cross join. I'd recommend starting with a small subset of data to get things working. — Gregor Thomas, Nov 07 '19 at 14:59

score 0 · Accepted Answer · answered Nov 08 '19 at 02:33

Here are 2 options using data.table:

1a) Using non-equi joins:

cols <- c(paste0("x.", names(strain_1)), paste0("i.", names(strain_2)))
DT <- rbindlist(list(
    strain_1[strain_2, on=.(st_A>=st_A, st_A<=ed_A), nomatch=0L, mget(cols)],
    strain_1[strain_2, on=.(st_A<=st_A, ed_A>=st_A), nomatch=0L, mget(cols)]
))

1b) Using foverlaps:

setkey(strain_1, st_A, ed_A)
setkey(strain_2, st_A, ed_A)
foverlaps(strain_1, strain_2, nomatch=0L)

And then another step 2 to get desired output:

DT[between(x.st_B, i.st_B, i.ed_B) | between(i.st_B, x.st_B, x.ed_B), 
    .(st_A=pmin(x.st_A, i.st_A), 
        ed_A=pmax(x.ed_A, i.ed_A), 
        st_B=pmin(x.st_B, i.st_B), 
        ed_B=pmax(x.ed_B, i.ed_B))]

output:

   st_A ed_A st_B ed_B
1:    5    9  123  129
2:   20   28   95  100
3:   35   39  140  147

data:

library(data.table)
strain_1 <- data.frame(st_A=c(7,25,35,48,89), ed_A=c(9,28,38,51,91),
    st_B=c(123,97,140,73, 13), ed_B=c(127,98,145,76,16))

strain_2 <- data.frame(st_A=c(5,20,36,49) ,  ed_A=c(8,25,39,50),    
    st_B=c(124,95,141,105) ,  ed_B=c(129,100,147,110))

result_desired <- data.frame(st_A=c(5,20,35), ed_A=c(9,28,39),
    st_B=c(123,95,140), ed_B=c(129,100,147))

setDT(strain_1)
setDT(strain_2)
setDT(result_desired)

p.s.: There should be something in Bioconducter with IRanges as well.

Thank you for help. In fact I get an error message whene I run the step 2: ERROR: Expecting a single value: [extent=4]. Do you have an idea where comes from? — , Nov 08 '19 at 08:33
It works very well. Thank you very much. In fact, it was a problem with R version — , Nov 08 '19 at 09:14
Perfect!, thanks. To avoid warning message I aded : `allow.cartesian=TRUE` to the script : `strain_1[strain_2, on=.(st_A<=st_A, ed_A>=st_A), nomatch=0L, mget(cols), allow.cartesian=TRUE]` — , Nov 08 '19 at 14:11

How to find overlapping regions between two data frames based on conditions

1 Answers1