0

can anyone help me improve this pandas code?

import pandas as pd

df = pd.DataFrame(
    [
        [
            'chr1', 222
        ],
        [
            'chr1', 233
        ],
        [
            'chr1', 2123
        ],
        [
            'chr2', 244
        ]
    ], columns = ['chrom', 'pos']
)

df2 = pd.DataFrame(
    [
        [
            'chr1', 221, 223
        ],
        [
            'chr1', 230, 240
        ],
    ], columns = ['chrom', 'start', 'end']
)

Gives me 2 dfs with genomic coordinates. The first one is an exact position:

    chrom   pos
0   chr1    222
1   chr1    233
2   chr1    2123
3   chr2    244

and the second is a range:

    chrom   start   end
0   chr1    221 223
1   chr1    230 240

I need to find the count of exact coordinates that are in one of the ranges (in the same chrom)

This works but is slow:

c=0
for chrom, data in df.groupby('chrom'):
    tmp = df2.query(f'chrom == "{chrom}"')
    for p in data.pos:
        for s, e in zip(tmp.start, tmp.end):
            if s < p < e:
                c+=1

Then c = 2

I think I can use agg to do this without iteration (and hopefully faster) but I can't get it working. Can anyone show me how?

PS this is also asked on the bioinformatics stack beta.

Liam McIntyre
  • 334
  • 1
  • 13

1 Answers1

1

Expected output has not been provided. This will count how many times a chrom is in a range.

  • expand out range using RangeIndex() and explode()
  • left merge()
  • groupby() and count()
dfc = df.merge(df2.assign(rng=df2.apply(lambda r: pd.RangeIndex(r["start"], r["end"]+1), axis=1)).explode("rng"),
         left_on=["chrom","pos"], right_on=["chrom","rng"],
         how="left"
        ).groupby(["chrom","pos"], as_index=False)["rng"].count()

chrom pos rng
0 chr1 222 1
1 chr1 233 1
2 chr1 2123 0
3 chr2 244 0
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30