I have a pandas dataframe that looks as the following one:
chrom start end probability read
0 chr1 1 10 0.99 read1
1 chr1 5 25 0.99 read2
2 chr1 15 25 0.99 read2
3 chr1 30 40 0.75 read4
What I wanna do is to merge the intervals that have the same chromosome (chrom column), and whose coordinates (start,end) overlap. In some situations, were multiple intervals overlap each other, there will be intervals that should be merged, even though they do not overlap. See row 0 and row 2 in the above mentioned example and the output of the merging below
For those elements that are merged, I want to sum their probabilities (probability column) and count the unique elements in the 'read' column.
Which would lead to the following output using the example above, note that rows 0,1 and 2 have been merged:
chrom start end probability read
0 chr1 1 20 2.97 2
1 chr1 30 40 0.75 1
Up to now, I have been doing this with pybedtools merge, but it has turn out that it is slow for doing it millions of times (my case). Hence, I am looking for other options and pandas is the obvious one. I know that with pandas groupby one can apply different operations to the columns that are going to be merged like nunique and sum, which are the ones that I will need to apply. Nevertheless, pandas groupby only merges data with exact 'chrom', 'start' and 'end' coordinates.
My problem is that I don't know how to use pandas to merge my rows based on the coordinates (chrom,start,end) and then apply the sum and nunique operations.
Is there a fast way of doing this?
thanks!
PS: As I have told on my question, I am doing this millions of times, so speed is a big issue. Hence, I am not able to use pybedtools or pure python, which are too slow for my goal.
Thanks!