I am using data.table::foverlaps
in the context of overlap genomics interval problems.
I recently got into trying to find the equivalent of foverlaps in Python because it would be much nicer to use only one langage instead of combining Python & R everytime I have to dig in analysis outputs.
Of course I am not the first one to ask the question of finding an equivalent to R foverlaps in python appliable in Python pandas. These are the most relevant post I found on SO :
2015 Merge pandas dataframes where one value is between two others
2016 R foverlaps equivalent in Python
2017 How to join two dataframes for which column values are within a certain range?
2018 How to reproduce the same output of foverlaps in R with merge of pandas in python?
The thing is I am not a specialist of Python at all. So I took the answer the most relevant/understandable looking like to me, the sqlite3
one.
This is how I am doing it in R :
library(data.table)
interv1 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("+",10))
interv2 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("-",10))
interv <- rbind(interv1, interv2)
interv <- data.table(interv)
colnames(interv) <- c('start', 'stop', 'color', 'strand')
interv$start <- as.integer(interv$start)
interv$stop <- as.integer(interv$stop)
interv$stop <- interv$stop -1
interv$cov <- runif(n=nrow(interv), min = 10, max = 200)
to_match <- data.table(cbind(rep(seq(from = 4, to = 43, by = 4),2), rep(c(rep("blue", 5), rep("red", 5)), 2), c(rep("-", 10), rep("+", 10))))
colnames(to_match) <- c('start', 'color', 'strand')
to_match$stop <- to_match$start
to_match$start <- as.integer(to_match$start)
to_match$stop <- as.integer(to_match$stop)
setkey(interv, color, strand, start, stop)
setkey(to_match, color, strand, start, stop)
overlapping_df <- foverlaps(to_match,interv)
#write.csv(x = interv, file = "Documents/script/SO/wig_foverlaps_test.txt", row.names = F)
#write.csv(x = to_match, file = "Documents/script/SO/cov_foverlaps_test.txt", row.names = F)
And this is how I tried to reproduce it in python :
import pandas as pd
import sqlite3
cov_table = pd.DataFrame(pd.read_csv('SO/cov_foverlaps_test.txt', skiprows = [0], header=None))
cov_table.columns = ['start', 'stop', 'chrm', 'strand', 'cov']
cov_table.stop = cov_table.stop - 1
wig_file = pd.DataFrame(pd.read_csv('SO/wig_foverlaps_test.txt', header=None, skiprows = [0]))
wig_file.columns = ['i_start', 'chrm', 'i_strand', 'i_stop']
cov_cols = ['start','stop','chrm','strand','cov']
fract_cols = ['i_start','i_stop','chrm','i_strand']
cov_table = cov_table.reindex(columns = cov_cols)
wig_file = wig_file.reindex(columns = fract_cols)
cov_table.start = pd.to_numeric(cov_table['start'])
cov_table.stop = pd.to_numeric(cov_table['stop'])
wig_file.i_start = pd.to_numeric(wig_file['i_start'])
wig_file.i_stop = pd.to_numeric(wig_file['i_stop'])
conn = sqlite3.connect(':memory:')
cov_table.to_sql('cov_table', conn, index=False)
wig_file.to_sql('wig', conn, index=False)
qry = '''
select
start PresTermStart,
stop PresTermEnd,
cov RightCov,
i_start pos,
strand Strand
from
cov_table join wig on
i_start between start and stop and
cov_table.strand = wig.i_strand
'''
test = pd.read_sql_query(qry, conn)
No matter what I change the code, I always find some small differencies in the output (test), in this example I'm missing two lines in the python resulting table, where the value that is supposed to fall in the range and is equal to the end of the range :
missing line :
> 19 24 141.306318 24 +
>
> 19 24 122.923700 24 -
Finally, I fear that if I find the right way to do it with sqlite3
, the computing time difference with data.table::foverlaps
will be too huge.
To conclude :
- My first question is ofc where did I go wrong in my code ?
- Is there an approach that would be more fitted and close to foverlaps in terms of computing speed?
Thanks you for reading, I hope this post is fitted for SO.