can anyone help me improve this pandas code?
import pandas as pd
df = pd.DataFrame(
[
[
'chr1', 222
],
[
'chr1', 233
],
[
'chr1', 2123
],
[
'chr2', 244
]
], columns = ['chrom', 'pos']
)
df2 = pd.DataFrame(
[
[
'chr1', 221, 223
],
[
'chr1', 230, 240
],
], columns = ['chrom', 'start', 'end']
)
Gives me 2 dfs with genomic coordinates. The first one is an exact position:
chrom pos
0 chr1 222
1 chr1 233
2 chr1 2123
3 chr2 244
and the second is a range:
chrom start end
0 chr1 221 223
1 chr1 230 240
I need to find the count of exact coordinates that are in one of the ranges (in the same chrom)
This works but is slow:
c=0
for chrom, data in df.groupby('chrom'):
tmp = df2.query(f'chrom == "{chrom}"')
for p in data.pos:
for s, e in zip(tmp.start, tmp.end):
if s < p < e:
c+=1
Then c = 2
I think I can use agg to do this without iteration (and hopefully faster) but I can't get it working. Can anyone show me how?
PS this is also asked on the bioinformatics stack beta.