2

In R (albeit longwinded):

Here is a test data.frame

df <- data.frame(
  "CHR" = c(1,1,1,2,2),
  "START" = c(100, 200, 300, 100, 400),
  "STOP" = c(150,350,400,500,450)
  )

First I make GRanges object:

gr <- GenomicRanges::GRanges(
  seqnames = df$CHR,
  ranges = IRanges(start = df$START, end = df$STOP)
  )

Then I reduce the intervals to collapse into new granges object:

reduced <- reduce(gr)

Now append a new column to original dataframe which confirms which rows belong to the same contiguous 'chunk'.

subjectHits(findOverlaps(gr, reduced))

Output:

> df
  CHR START STOP locus
1   1   100  150     1
2   1   200  350     2
3   1   300  400     2
4   2   100  500     3
5   2   400  450     3

How do I do this in Python? I am aware of pybedtools, but to my knowledge, this would require me to save my data.frame to disk. Any help appreciated.

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
mbyvcm
  • 141
  • 2
  • 12

2 Answers2

2

https://github.com/biocore-ntnu/pyranges

import pyranges as pr
chromosomes = [1] * 3 + [2] * 2
starts = [100, 200, 300, 100, 400]
ends = [150, 350, 400, 500, 450]
gr = pr.PyRanges(chromosomes=chromosomes, starts=starts, ends=ends)
gr.cluster()
# +--------------+-----------+-----------+-----------+
# |   Chromosome |     Start |       End |   Cluster |
# |       (int8) |   (int32) |   (int32) |   (int64) |
# |--------------+-----------+-----------+-----------|
# |            1 |       100 |       150 |         1 |
# |            1 |       200 |       350 |         2 |
# |            1 |       300 |       400 |         2 |
# |            2 |       100 |       500 |         3 |
# |            2 |       400 |       450 |         3 |
# +--------------+-----------+-----------+-----------+

It will be out in 0.0.21. Thanks for the idea!

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • 2
    If you want to produce the actual regions you can do the following: `gr.cluster().df.groupby(['Cluster']).agg({'Chromosome':'first', 'Start':'min', 'End':'max'})[['Chromosome','Start','End']]` – Robert Aug 31 '20 at 14:14
  • `gr.merge()` performs identical behavior. See https://pyranges.readthedocs.io/en/master/autoapi/pyranges/index.html#pyranges.PyRanges.merge – Muhammed Hasan Celik Apr 21 '21 at 01:14
  • Merge is similar, but not equivalent: it would only result in three lines. – The Unfun Cat Apr 21 '21 at 10:37
1

It appears you are trying to get the intersection of these. Pybedtools will accept streams as an input. Read your data into a string that is in bed format.

"chr,start,stop"

I start with a python dictionary and loop through it.

bed_string += "{0} {1} {2} {3} {0}|{1}|{2}|{3}\n".format(chrom, coord_start, coord_stop, aberration)
# Now create your bedtools.
breakpoint_bedtool = pybedtools.BedTool(bed_string, from_string=True)
target_bedtool = pybedtools.BedTool(self.args.Target_Bed_File, from_string=False)
# Find target intersects for printing.
breakpoint_target_intersect = breakpoint_bedtool.intersect(target_bedtool, wb=True, stream=True)
Dennis Simpson
  • 85
  • 1
  • 10