Can I use pandas agg to do this without iteration?

Question

can anyone help me improve this pandas code?

import pandas as pd

df = pd.DataFrame(
    [
        [
            'chr1', 222
        ],
        [
            'chr1', 233
        ],
        [
            'chr1', 2123
        ],
        [
            'chr2', 244
        ]
    ], columns = ['chrom', 'pos']
)

df2 = pd.DataFrame(
    [
        [
            'chr1', 221, 223
        ],
        [
            'chr1', 230, 240
        ],
    ], columns = ['chrom', 'start', 'end']
)

Gives me 2 dfs with genomic coordinates. The first one is an exact position:

    chrom   pos
0   chr1    222
1   chr1    233
2   chr1    2123
3   chr2    244

and the second is a range:

    chrom   start   end
0   chr1    221 223
1   chr1    230 240

I need to find the count of exact coordinates that are in one of the ranges (in the same chrom)

This works but is slow:

c=0
for chrom, data in df.groupby('chrom'):
    tmp = df2.query(f'chrom == "{chrom}"')
    for p in data.pos:
        for s, e in zip(tmp.start, tmp.end):
            if s < p < e:
                c+=1

Then c = 2

I think I can use agg to do this without iteration (and hopefully faster) but I can't get it working. Can anyone show me how?

PS this is also asked on the bioinformatics stack beta.

score 1 · Answer 1 · answered Feb 12 '21 at 14:15

1

Expected output has not been provided. This will count how many times a chrom is in a range.

expand out range using RangeIndex() and explode()
left merge()
groupby() and count()

dfc = df.merge(df2.assign(rng=df2.apply(lambda r: pd.RangeIndex(r["start"], r["end"]+1), axis=1)).explode("rng"),
         left_on=["chrom","pos"], right_on=["chrom","rng"],
         how="left"
        ).groupby(["chrom","pos"], as_index=False)["rng"].count()

	chrom	pos	rng
0	chr1	222	1
1	chr1	233	1
2	chr1	2123	0
3	chr2	244	0

answered Feb 12 '21 at 14:15

Rob Raymond

29,118
3
14
30

Expected output is C= 2. Maybe I should have put that in the code section, sorry. I can see how to get there from your code though, thanks! – Liam McIntyre Feb 12 '21 at 19:43
so it's the sum of the **rng** column? – Rob Raymond Feb 12 '21 at 20:11
Based on that small output, yes. It's just the sum of all instances where a pos (genomic position) is between a given start and end, in the same chromosome. – Liam McIntyre Feb 12 '21 at 20:16
I think thats actually slower. And its very RAM hungry, I tried it on an actual dataset and put my laptop into swap. I'll have to try on the server. – Liam McIntyre Feb 12 '21 at 21:07
how many rows in the two data frames? I can see that expanding out the ranges could generate a large structure – Rob Raymond Feb 12 '21 at 21:28
1

https://stackoverflow.com/questions/44367672/best-way-to-join-merge-by-range-in-pandas is effectively same question you asked, where solution is tuned – Rob Raymond Feb 12 '21 at 21:34
there's 2-3k positions and ~100k ranges in the full dfs. – Liam McIntyre Feb 12 '21 at 22:25

Can I use pandas agg to do this without iteration?

1 Answers1