I have a huge data frame that looks like this.
I want to group_by(chr)
, and then for each chr
to find
- Is any range1 (start1, end1), within the chr group, overlapping with any range2 (start2,end2)?
library(dplyr)
df1 <- tibble(chr=c(1,1,2,2),
start1=c(100,200,100,200),
end1=c(150,400,150,400),
species=c("Penguin"),
start2=c(200,200,500,1000),
end2=c(250,240,1000,2000)
)
df1
#> # A tibble: 4 × 6
#> chr start1 end1 species start2 end2
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 100 150 Penguin 200 250
#> 2 1 200 400 Penguin 200 240
#> 3 2 100 150 Penguin 500 1000
#> 4 2 200 400 Penguin 1000 2000
Created on 2023-01-05 with reprex v2.0.2
I want my data to look like this. Essentially I want to check if the range2 overlaps with any range1. The new data does not change the question, but proof checks the code
# A tibble: 4 × 6
chr start1 end1 species start2 end2 OVERLAP
1 100 150 Penguin 200 250 TRUE
1 200 400 Penguin 200 240 TRUE
2 100 150 Penguin 500 1000 FALSE
2 200 400 Penguin 1000 2000 FALSE
I have fought a lot with the ivs
package and iv_overlaps
with no success in getting what I want.
Major EDIT:
When I apply any of the codes in real data, I am not getting the results I want, and I am so confused. Why? The new data dataset does not change the question, but proofs check the code
data <- tibble::tribble(
~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,
"Chr2", 2739, 2840, "+", "A", 740, 1739,
"Chr2", 12577, 12678, "+", "B", 10578, 11577,
"Chr2", 22431, 22532, "+", "C", 20432, 21431,
"Chr2", 32202, 32303, "+", "D", 30203, 31202,
"Chr2", 42024, 42125, "+", "E", 40025, 41024,
"Chr2", 51830, 51931, "+", "F", 49831, 50830,
"Chr2", 82061, 84742, "+", "G", 80062, 81061,
"Chr2", 84811, 86692, "+", "H", 82812, 83811,
"Chr2", 86782, 88106, "-", "I", 88107, 89106,
"Chr2", 139454, 139555, "+", "J", 137455, 138454,
)
data %>%
group_by(chr) %>%
mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))
then It gives as an output
chr start1 end1 strand gene start2 end2 overlap
<chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <lgl>
1 Chr2 2739 2840 + A 740 1739 TRUE
2 Chr2 12577 12678 + B 10578 11577 TRUE
3 Chr2 22431 22532 + C 20432 21431 TRUE
4 Chr2 32202 32303 + D 30203 31202 TRUE
5 Chr2 42024 42125 + E 40025 41024 TRUE
6 Chr2 51830 51931 + F 49831 50830 TRUE
7 Chr2 82061 84742 + G 80062 81061 TRUE
8 Chr2 84811 86692 + H 82812 83811 TRUE
9 Chr2 86782 88106 - I 88107 89106 TRUE
10 Chr2 139454 139555 + J 137455 138454 TRUE
Which is wrong. They might be indirect matches, but there there is not a direct overlap.