In R (albeit longwinded):
Here is a test data.frame
df <- data.frame(
"CHR" = c(1,1,1,2,2),
"START" = c(100, 200, 300, 100, 400),
"STOP" = c(150,350,400,500,450)
)
First I make GRanges object:
gr <- GenomicRanges::GRanges(
seqnames = df$CHR,
ranges = IRanges(start = df$START, end = df$STOP)
)
Then I reduce the intervals to collapse into new granges object:
reduced <- reduce(gr)
Now append a new column to original dataframe which confirms which rows belong to the same contiguous 'chunk'.
subjectHits(findOverlaps(gr, reduced))
Output:
> df
CHR START STOP locus
1 1 100 150 1
2 1 200 350 2
3 1 300 400 2
4 2 100 500 3
5 2 400 450 3
How do I do this in Python? I am aware of pybedtools, but to my knowledge, this would require me to save my data.frame to disk. Any help appreciated.