I wrote a python script with a nested for loop which adds elements of first pandas df in a new column of the second pandas df with a condition that element of a column in first pandas df is in between ANY elements of two columns of the second pandas df. My python script is running fine but it is taking a lot of time to complete. It will be great if someone help me to improve the script so that the run time is reduced as my dataset is very large. Looking forward for the answers.
Here is the sample of the first pandas df (GT) and second pandas df (GC) -
import pandas as pd
GT = pd.DataFrame({'CHROM': ['chr1', 'chr1', 'chr1','chr1'],
'POS': [23197, 23308, 634553, 727233]
'HET': [0,2,3,2]})
GC=pd.DataFrame({'Gene_ID': ['ENSG00000227232', 'ENSG00000269981', 'ENSG00000279457','ENSG00000225972'],
'Gene_Name': ['WASH7P', 'ENSG00000269981', 'WASH9P', 'MTND1P23']
'start': [14404,137682,185217,629062]
'end':[29570,137965,195411,629433]})
and here is my Python code that I ran on these pandas df
import pandas as pd
with open("merged_file.txt","w") as fp:
Gene_ID=[]
Gene_Name=[]
Gene_Coordinate_start=[]
Gene_Coordinate_end=[]
for m in range(0,len(GT["POS"])-1):
for n in range(0,len(GC["start"])-1):
if GT["POS"].iloc[m] >= GC["start"].iloc[n] and GT["POS"].iloc[m] <= GC["end"].iloc[n]:
Gene_ID.append(GC["Gene_ID"].iloc[n])
Gene_Name.append(GC["Gene_Name"].iloc[n])
Gene_Coordinate_start.append(GC["start"].iloc[n])
Gene_Coordinate_end.append(GC["end"].iloc[n])
print(m)
print(n)
GT["Gene_ID"]=Gene_ID
GT["Gene_Name"]=Gene_Name
GT["Gene_Coordinate_start"]=Gene_Coordinate_start
GT["Gene_Coordinate_end"]=Gene_Coordinate_end
GT.to_csv('merge_snp_gene.csv')