0

I wrote a python script with a nested for loop which adds elements of first pandas df in a new column of the second pandas df with a condition that element of a column in first pandas df is in between ANY elements of two columns of the second pandas df. My python script is running fine but it is taking a lot of time to complete. It will be great if someone help me to improve the script so that the run time is reduced as my dataset is very large. Looking forward for the answers.

Here is the sample of the first pandas df (GT) and second pandas df (GC) -

import pandas as pd
GT = pd.DataFrame({'CHROM': ['chr1', 'chr1', 'chr1','chr1'], 
                 'POS': [23197, 23308, 634553, 727233]
                 'HET': [0,2,3,2]})
GC=pd.DataFrame({'Gene_ID': ['ENSG00000227232', 'ENSG00000269981', 'ENSG00000279457','ENSG00000225972'], 
                 'Gene_Name': ['WASH7P', 'ENSG00000269981', 'WASH9P', 'MTND1P23']
                 'start': [14404,137682,185217,629062]
                 'end':[29570,137965,195411,629433]})

and here is my Python code that I ran on these pandas df

import pandas as pd
with open("merged_file.txt","w") as fp:
        Gene_ID=[]
        Gene_Name=[]
        Gene_Coordinate_start=[]
        Gene_Coordinate_end=[]
        for m in range(0,len(GT["POS"])-1):
                for n in range(0,len(GC["start"])-1):
                        if GT["POS"].iloc[m] >= GC["start"].iloc[n] and GT["POS"].iloc[m] <= GC["end"].iloc[n]:
                                Gene_ID.append(GC["Gene_ID"].iloc[n])
                                Gene_Name.append(GC["Gene_Name"].iloc[n])
                                Gene_Coordinate_start.append(GC["start"].iloc[n])
                                Gene_Coordinate_end.append(GC["end"].iloc[n])
                                print(m)
                                print(n)
        GT["Gene_ID"]=Gene_ID
        GT["Gene_Name"]=Gene_Name
        GT["Gene_Coordinate_start"]=Gene_Coordinate_start
        GT["Gene_Coordinate_end"]=Gene_Coordinate_end
        GT.to_csv('merge_snp_gene.csv')
  • 1
    Please have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] your question to include a [mcve] in the body of the question as text, not as images. – G. Anderson Mar 31 '22 at 21:21
  • Thank you for your comments. I modified my question and included the minimal reproducible data sample and code in the body of the question as you suggested – Mrinal Mishra Mar 31 '22 at 22:49

0 Answers0