1

I am trying to find pairs of intervals that overlap by at least some minimum overlap length that is set by the user. The intervals are from this pandas dataframe:

import pandas as pds

print(df1.head())
print(df1.tail())
   query_id  start_pos  end_pos  read_length orientation
0   1687655          1     4158         4158           F
1   2485364          1     7233         7233           R
2   1412202          1     3215         3215           R
3   1765889          1     3010         3010           R
4   2944965          1     4199         4199           R
         query_id  start_pos   end_pos  read_length orientation
3082467    112838   27863832  27865583         1752           F
3082468    138670   28431208  28431804          597           R
3082469    171683   28489928  28490409          482           F
3082470   2930053   28569533  28569860          328           F
3082471   1896622   28589281  28589554          274           R

where start_pos is where the interval starts and end_pos is where the interval ends. read_length is the length of the interval.

The data is sorted by start_pos.

The program should have the following output format:

query_id1 -- query_id2 -- read_length1 -- read_length2 -- overlap_length

I am running the program on a compute node with up to 512gb RAM and 4x Intel Xeon E7-4830 CPU (32 cores).

I've tried running my own code to find the overlaps, but it is taking too long to run.

Here is the code that I tried,

import pandas as pds

overlap_df = pds.DataFrame()

def create_overlap_table(df1, ovl_len):

...
(sort and clean the data here)
...

    def iterate_queries(row):
        global overlap_df
        index1 = df1.index[df1['query_id'] == row['query_id']]
        next_int_index = df1.index.get_loc(index1[0]) + 1

        if row['read_length'] >= ovl_len:
            if df1.index.size-1 >= next_int_index:
                end_pos_minus_ovlp = (row['end_pos'] - ovl_len) + 2

                subset_df = df1.loc[(df1['start_pos'] < end_pos_minus_ovlp)]
                subset_df = subset_df.loc[subset_df.index == subset_df.index.max()]
                subset_df = df1.iloc[next_int_index:df1.index.get_loc(subset_df.index[0])]
                subset_df = subset_df.loc[subset_df['read_length'] >= ovl_len]

                rows1 = pds.DataFrame({'read_id1': np.repeat(row['query_id'], repeats=subset_df.index.size), 'read_id2': subset_df['query_id']})
                overlap_df = overlap_df.append(rows1)

    df1.apply(iterate_queries, axis=1)

    print(overlap_df)

Again, I ran this code on the compute node, but it was running for hours before I finally cancelled the job.

I've also tried using two packages for this problem--PyRanges, as well as an R package called IRanges, but they take too long to run as well. I've seen posts on interval trees and a python library called pybedtools, and I was planning on looking into them as a next step.

Any feedback would be really appreciated


EDIT: For minimum overlap length of, say 800, the first 5 rows should look like this,

query_id1 query_id2 read_length1 read_length2 overlap_length
1687655    2485364       4158         7233          4158
1687655    1412202       4158         3215          3215
1687655    1765889       4158         3010          3010             
1687655    2944965       4158         4199          4158
2485364    1412202       7233         3215          3215

So, query_id1 and query_id2 cannot be identical. Also, no duplications (i.e., an overlap between A and B should not appear twice in the output).

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
Michael
  • 13
  • 4
  • What should be desired output for first five rows – U13-Forward Jul 15 '19 at 02:16
  • I am guessing your problem is that there is a lot of overlap in your file so you basically have to enumerate the square of millions of intervals so any algorithm would be slow. I am going to think about how to find the interval with property X (for example longest) in pyranges though. It is likely to pop up often enough. – The Unfun Cat Aug 20 '19 at 06:12
  • 1
    @TheUnfunCat Yes, you're right. My intervals are very long in length (median length: 3,275 units; mean length: 4,012 units) and, hence, there are many overlaps between the intervals. I had a total of 3,082,472 intervals. I was able to eventuallysolve this problem by Cythonizing my Python code for an algorithm similar to that described by n.m. – Michael Aug 21 '19 at 23:09

2 Answers2

1

Here's an algorithm.

  1. Prepare a set of intervals sorted by starting point. Initially the set is empty.
  2. Sort all starting and ending points.
  3. Traverse the points. If a starting point is encountered, add the corresponding interval to the set. If an ending point is encountered, remove the corresponding interval from the set.
  4. When removing an interval, look at other intervals in the set. They all overlap the interval being removed, and they are sorted by the length of the overlap, longest first. Traverse the set until the length is too short, and report each overlap.
n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
1

pyranges author here. Thanks for trying my library.

How big are your data? When both PyRanges were 1e7 rows, the heavy part done by pyranges took about 12 seconds using 24 cores on a busy server with 200 gb ram.

Setup:

import pyranges as pr
import numpy as np
import mkl

mkl.set_num_threads(1)

### Create data ###
length = int(1e7)
minimum_overlap = 5

gr = pr.random(length)
gr2 = pr.random(length)

# add ids
gr.id1 = np.arange(len(gr))
gr2.id2 = np.arange(len(gr))

# add lengths
gr.length1 = gr.lengths()
gr2.length2 = gr2.lengths()
gr

# +--------------+-----------+-----------+--------------+-----------+-----------+
# | Chromosome   | Start     | End       | Strand       | id1       | length1   |
# | (category)   | (int32)   | (int32)   | (category)   | (int64)   | (int32)   |
# |--------------+-----------+-----------+--------------+-----------+-----------|
# | chr1         | 146230338 | 146230438 | +            | 0         | 100       |
# | chr1         | 199561432 | 199561532 | +            | 1         | 100       |
# | chr1         | 189095813 | 189095913 | +            | 2         | 100       |
# | chr1         | 27608425  | 27608525  | +            | 3         | 100       |
# | ...          | ...       | ...       | ...          | ...       | ...       |
# | chrY         | 21533766  | 21533866  | -            | 9999996   | 100       |
# | chrY         | 30105890  | 30105990  | -            | 9999997   | 100       |
# | chrY         | 49764407  | 49764507  | -            | 9999998   | 100       |
# | chrY         | 3319478   | 3319578   | -            | 9999999   | 100       |
# +--------------+-----------+-----------+--------------+-----------+-----------+
# Stranded PyRanges object has 10,000,000 rows and 6 columns from 25 chromosomes.

Doing your analysis:

j = gr.join(gr2, new_pos="intersection", nb_cpu=24)
# CPU times: user 3.85 s, sys: 3.56 s, total: 7.41 s
# Wall time: 12.3 s
j.overlap = j.lengths()
out = j.df["id1 id2 length1 length2 overlap".split()]

out = out[out.overlap >= minimum_overlap]
out.head()

       id1     id2  length1  length2  overlap
1    2  485629      100      100       74
2    2  418820      100      100       92
3    3  487066      100      100       13
4    7  191109      100      100       31
5   11  403447      100      100       76
The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • Also, bedtools/pybedtools are wonderful libraries with many great features, but they are not in-memory, so slower for most things I tested :) – The Unfun Cat Aug 19 '19 at 08:24