I have a large data frame that looks like this. I want to find which genes match the others based on an overlap between the start and end positions.
library(tidyverse)
data <- data.frame(group=c(1,1,1,2,2,2),
genes=c("A","B","C","D","E","F"),
start=c(1000,2000,3000,800,400,2000),
end=c(1500,2500,3500,1200,500,10000))
data
#> group genes start end
#> 1 1 A 1000 1500
#> 2 1 B 2000 2500
#> 3 1 C 3000 3500
#> 4 2 D 800 1200
#> 5 2 E 400 500
#> 6 2 F 2000 10000
Created on 2022-12-05 with reprex v2.0.2
I want something like this.
data
#> group genes start end match
#> 1 1 A 1000 1500 A-D
#> 2 1 B 2000 2500 B-F
#> 3 1 C 3000 3500 C-F
#> 4 2 D 800 1200 A-D
#> 5 2 E 400 500 NA
#> 6 2 F 2000 10000 F-C-B
I am a bit lost on how to start. Any help is appreciated