I have a large data frame that looks like this.
I want to group_by
seqnames and for each group, I want to check for overlapping ranges between the start and end.
If there is any overlapping range, then it should stay the row with the highest score.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(seqnames=rep(c("Chr1","Chr2"),each=3),
start=c(100,200,300,100,200,300),
end=c(150,400,500,120,220,320),
score=c(1000,500,1000,1000,1000,1000))
df
#> # A tibble: 6 × 4
#> seqnames start end score
#> <chr> <dbl> <dbl> <dbl>
#> 1 Chr1 100 150 1000
#> 2 Chr1 200 400 500
#> 3 Chr1 300 500 1000
#> 4 Chr2 100 120 1000
#> 5 Chr2 200 220 1000
#> 6 Chr2 300 320 1000
Created on 2022-12-27 with reprex v2.0.2
the desired output is
seqnames start end score
<chr> <dbl> <dbl> <dbl>
Chr1 100 150 1000
Chr1 300 500 1000
Chr2 100 120 1000
Chr2 200 220 1000
Chr2 300 320 1000