2

I am using data.table::foverlaps in the context of overlap genomics interval problems. I recently got into trying to find the equivalent of foverlaps in Python because it would be much nicer to use only one langage instead of combining Python & R everytime I have to dig in analysis outputs. Of course I am not the first one to ask the question of finding an equivalent to R foverlaps in python appliable in Python pandas. These are the most relevant post I found on SO :

2015 Merge pandas dataframes where one value is between two others

2016 R foverlaps equivalent in Python

2017 How to join two dataframes for which column values are within a certain range?

2018 How to reproduce the same output of foverlaps in R with merge of pandas in python?

The thing is I am not a specialist of Python at all. So I took the answer the most relevant/understandable looking like to me, the sqlite3 one.

This is how I am doing it in R :

library(data.table)

interv1 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("+",10))
interv2 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("-",10))
interv  <- rbind(interv1, interv2)
interv <- data.table(interv)
colnames(interv) <- c('start', 'stop', 'color', 'strand')
interv$start <- as.integer(interv$start)
interv$stop <- as.integer(interv$stop)
interv$stop <- interv$stop -1
interv$cov <- runif(n=nrow(interv), min = 10, max = 200)

to_match <- data.table(cbind(rep(seq(from = 4, to = 43, by = 4),2), rep(c(rep("blue", 5), rep("red", 5)), 2), c(rep("-", 10), rep("+", 10))))
colnames(to_match) <- c('start', 'color', 'strand')
to_match$stop <-  to_match$start 
to_match$start <- as.integer(to_match$start)
to_match$stop <- as.integer(to_match$stop)

setkey(interv, color, strand, start, stop)
setkey(to_match, color, strand, start, stop)

overlapping_df <- foverlaps(to_match,interv)

#write.csv(x = interv, file = "Documents/script/SO/wig_foverlaps_test.txt", row.names = F)
#write.csv(x = to_match, file = "Documents/script/SO/cov_foverlaps_test.txt", row.names = F)

And this is how I tried to reproduce it in python :

import pandas as pd
import sqlite3

cov_table = pd.DataFrame(pd.read_csv('SO/cov_foverlaps_test.txt', skiprows = [0], header=None))
cov_table.columns = ['start', 'stop', 'chrm', 'strand', 'cov']
cov_table.stop = cov_table.stop - 1


wig_file = pd.DataFrame(pd.read_csv('SO/wig_foverlaps_test.txt', header=None, skiprows = [0]))
wig_file.columns = ['i_start', 'chrm', 'i_strand', 'i_stop']

cov_cols = ['start','stop','chrm','strand','cov']
fract_cols = ['i_start','i_stop','chrm','i_strand']

cov_table = cov_table.reindex(columns = cov_cols)
wig_file = wig_file.reindex(columns = fract_cols)

cov_table.start = pd.to_numeric(cov_table['start'])
cov_table.stop = pd.to_numeric(cov_table['stop'])

wig_file.i_start = pd.to_numeric(wig_file['i_start'])
wig_file.i_stop = pd.to_numeric(wig_file['i_stop'])



conn = sqlite3.connect(':memory:')

cov_table.to_sql('cov_table', conn, index=False)
wig_file.to_sql('wig', conn, index=False)

qry = '''
    select  
        start PresTermStart,
        stop PresTermEnd,
        cov RightCov,
        i_start pos,
        strand Strand
    from
        cov_table join wig on
        i_start between start and stop and 
        cov_table.strand = wig.i_strand
     '''

test = pd.read_sql_query(qry, conn)

No matter what I change the code, I always find some small differencies in the output (test), in this example I'm missing two lines in the python resulting table, where the value that is supposed to fall in the range and is equal to the end of the range :

missing line :

> 19   24  141.306318     24      +
> 
> 19   24  122.923700     24      -

Finally, I fear that if I find the right way to do it with sqlite3, the computing time difference with data.table::foverlaps will be too huge.

To conclude :

  • My first question is ofc where did I go wrong in my code ?
  • Is there an approach that would be more fitted and close to foverlaps in terms of computing speed?

Thanks you for reading, I hope this post is fitted for SO.

Paul Endymion
  • 537
  • 3
  • 18

2 Answers2

2

Essentially, the merged and interval logic differs between R and Python.

R

According to foverlaps docs, you are using the default, any type which runs below condition:

Let [a,b] and [c,d] be intervals in x and y with a<=b and c<=d.
...
For type="any", as long as c<=b and d>=a, they overlap.

In addition, you join on the other columns of keys. Altogether, you are imposing the following logic (translated to SQLite columns for comparison):

foverlaps(to_match, interv) --> foverlaps(cov_table, wig)

  1. wig.i_start <= cov_table.stop (i.e., c <= b)
  2. wig.i_stop >= cov_table.start (i.e., d >= a)
  3. wig.color == cov_table.color
  4. wig.strand == cov_table.strand

Python

You are running an INNER JOIN + interval query imposing the following logic:

  1. wig.i_start >= cov_table.start (i.e., i_start between start and stop)
  2. wig.i_start <= cov_table.stop (i.e., i_start between start and stop)
  3. wig.strand == cov_table.strand

Noticeable Python differences in comparison to R: wig.i_stop is never used; wig.i_chrm (or color) is never used; and wig.i_start is conditioned twice.

To resolve, consider the following untested SQL adjustment to hopefully achieve the R results. By the way it is best practice in SQL to alias ALL columns in JOIN clauses (even SELECT):

select  
   cov_table.start as PresTermStart,
   cov_table.stop as PresTermEnd,
   cov_table.cov as RightCov,
   wig.i_start as pos,
   wig.strand as Strand
from
   cov_table 
join wig 
    on cov_table.color = wig.i_chrm
   and cov_table.strand = wig.i_strand
   and wig.i_start <= cov_table.stop 
   and wig.i_stop  >= cov_table.start 

For better performance, consider using a persistent (not in-memory) SQLite database and create indexes on join fields: color, strand, start and stop.

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • NIce! It did the trick with a few modifications. It makes things clearer. I feel I should dig more these merging problems. – Paul Endymion May 23 '19 at 17:32
  • Great to hear! Yes, I thought some adjustments would be required. Happy coding! – Parfait May 23 '19 at 17:45
  • SQL indexes are not similar to `setkey`, data.table do have indexes (`setindex`). `setkey` is a specific `clustered index`, not sure if sqlite has clustered index. – jangorecki May 24 '19 at 14:01
  • Thanks @jangorecki. Edited out similarity line. Do note: some RDBMS' do support cluster indexes but varies by vendor. OP uses SQLite as the vendor which its [`WITHOUT ROWID`](https://www.sqlite.org/withoutrowid.html) uses a cluster index as primary key. – Parfait May 24 '19 at 14:12
1

To do interval overlaps in Python, simply use pyranges:

import pyranges as pr

c1 = """Chromosome Start End Gene
1 10 20 blo
1 45 46 bla"""

c2 = """Chromosome Start End Gene
1 10 35 bip
1 25 50 P53
1 40 10000 boop"""


gr1, gr2 = pr.from_string(c1), pr.from_string(c2)

j = gr1.join(gr2)
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# |   Chromosome |     Start |       End | Gene       |   Start_b |     End_b | Gene_b     |
# |   (category) |   (int32) |   (int32) | (object)   |   (int32) |   (int32) | (object)   |
# |--------------+-----------+-----------+------------+-----------+-----------+------------|
# |            1 |        10 |        20 | blo        |        10 |        35 | bip        |
# |            1 |        45 |        46 | bla        |        25 |        50 | P53        |
# |            1 |        45 |        46 | bla        |        40 |     10000 | boop       |
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# Unstranded PyRanges object has 3 rows and 7 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • Thank you for your answer it looks great ! Did you write this library ? I see benchmark at the bottom of the page, by any chance did you do one with data.table's foverlaps ? – Paul Endymion Apr 24 '20 at 12:17
  • I did not, I did not even know of data.tables existence :) But pyranges easily handles large datasets efficiently. Yes, I wrote it :) – The Unfun Cat Apr 24 '20 at 13:45