0

I'm trying to use pyranges for some efficient implementation but it's very limited and inflexible compared to R GenomicRanges.

Let's say if I have two Pyranges Tables: pr1 & pr2. I want to get the indices of the overlaping rows of pr1 in pr2 and, in case it no overlapping occurs I want to get a null instead. Is that possible?

For example, let's say we have two tables. pr1 is a list of chromosomes and lists the start and end positions with, say, 50 rows. And table 2 pr2, with 1000s of rows, has chromosome start and end, and also probe coverages. How can I get the overlap between these two? I basically want a vector of 50, 1 per row of pr1, which are the indices of the second table pr2 that overlap each of the rows in pr1. And if any row in pr1 does not have an overlap, the equivalent value for that in the returned vector should be NULL. Just how it's implemented in R. Can I do this with pyranges?

mgmussi
  • 520
  • 1
  • 4
  • 19
  • Ranges have `union` and `intersect` methods. Don't those do what you want? – Tim Roberts Aug 31 '22 at 18:40
  • @TimRoberts I dont see any union method, all they have is intersect and overlap. For example if I do pr1.overlap(pr2) it just returns a table returning the rows in pr1 that are overlapped by pr2, it does not say which rows in pr2 are the ones covering the rows in pr1... so its pretty useless. Is there a way to know the indices for the rows in pr2 that are the ones covering the rows in pr1? and if there are non then return NULL for those rows – Moses Stamboulian Aug 31 '22 at 18:46
  • pr1.join(pr2, how="left") – The Unfun Cat Sep 01 '22 at 04:07

1 Answers1

0

As one of the comments pointed out, you can use pyranges.join function. Let's make up some data:

import numpy as np, pyranges as pr, pandas as pd
f1 = pr.from_dict({'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [3, 8, 5],
                   'End': [6, 9, 7], 'Name': ['interval1', 'interval3', 'interval2']})
f2 = pr.from_dict({'Chromosome': ['chr1', 'chr1'], 'Start': [1, 6],
                   'End': [2, 7], 'Name': ['a', 'b']})
print(f1)
+--------------+-----------+-----------+------------+
| Chromosome   |     Start |       End | Name       |
| (category)   |   (int32) |   (int32) | (object)   |
|--------------+-----------+-----------+------------|
| chr1         |         3 |         6 | interval1  |
| chr1         |         8 |         9 | interval3  |
| chr1         |         5 |         7 | interval2  |
+--------------+-----------+-----------+------------+
Unstranded PyRanges object has 3 rows and 4 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

print(f2)
+--------------+-----------+-----------+------------+
| Chromosome   |     Start |       End | Name       |
| (category)   |   (int32) |   (int32) | (object)   |
|--------------+-----------+-----------+------------|
| chr1         |         1 |         2 | a          |
| chr1         |         6 |         7 | b          |
+--------------+-----------+-----------+------------+
Unstranded PyRanges object has 2 rows and 4 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

The join method will return a table whose rows correspond to overlapping ranges in f1 and f2. In the returned object, the column names of f1 are preserved, while those of f2 that are also present in f1 have a suffix added (by default, "_b"):

f1.join(f2)
+--------------+-----------+-----------+------------+-----------+-----------+------------+
| Chromosome   |     Start |       End | Name       |   Start_b |     End_b | Name_b     |
| (category)   |   (int32) |   (int32) | (object)   |   (int32) |   (int32) | (object)   |
|--------------+-----------+-----------+------------+-----------+-----------+------------|
| chr1         |         5 |         7 | interval2  |         6 |         7 | b          |
+--------------+-----------+-----------+------------+-----------+-----------+------------+
Unstranded PyRanges object has 1 rows and 7 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

For the overlapping intervals, all the information that was present in f1 and f2 is now available in the object return by join.

Note: by default, join will only return the overlapping intervals. If you also want rows in f1 with no overlap in f2, or those in f2 with no overlap in f1, you can use how='left' or how='right' respectively:

f1.join(f2, how='left')

+--------------+-----------+-----------+------------+-----------+-----------+------------+
| Chromosome   |     Start |       End | Name       |   Start_b |     End_b | Name_b     |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) |   (int64) | (object)   |
|--------------+-----------+-----------+------------+-----------+-----------+------------|
| chr1         |         5 |         7 | interval2  |         6 |         7 | b          |
| chr1         |         3 |         6 | interval1  |        -1 |        -1 | -1         |
| chr1         |         8 |         9 | interval3  |        -1 |        -1 | -1         |
+--------------+-----------+-----------+------------+-----------+-----------+------------+
Unstranded PyRanges object has 3 rows and 7 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

Now, you claim to want the indices of the overlapping rows. Thing is, because of its underlying implementation (dictionary of dataframes, one per chromosome), AFAIK you shouldn't be relying on row indices for any task in PyRanges. You can't use them to get certain rows, for example. To all effects, the indices of the dataframes underlying a pyranges object are inaccessible and unmodifiable.

If you really need to, you may simulate the behaviour of indices by creating numerical columns and using them to subset certain rows:

f1.index1=np.arange( len(f1) )
f2.index2=np.arange( len(f2) )
jf=f1.join(f2)

# make f1 a dataframe, get rows which have an overlap in f2
f1[ f1.index1.isin( jf.index1.unique() ) ]

+--------------+-----------+-----------+------------+-----------+
| Chromosome   |     Start |       End | Name       |    index1 |
| (category)   |   (int32) |   (int32) | (object)   |   (int64) |
|--------------+-----------+-----------+------------+-----------|
| chr1         |         5 |         7 | interval2  |         2 |
+--------------+-----------+-----------+------------+-----------+
Unstranded PyRanges object has 1 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.