This is my first question on forum. Thanks for any help!
I wrote nested for loop based on df.iterrows () (sic.) and it takes a huuuuuge amount of time to perform. I need to assing value from one dataframe into another one by checking all the cells in described condition. Can you just help me to make it effective? (multiprocessing, apply method, vectorization or anything else?) Would be so grateful! :)
Sample data:
import pandas as pd
import numpy as np
d1 = {'geno_start' : [60, 1120, 1660], 'geno_end' : [90, 1150, 1690], 'original_subseq' : ['AAATGCCTGAACCTTGGAATTGGA', 'AAATGCCTGAACCTTGGAATTGGA', 'AAATGCCTGAACCTTGGAATTGGA']}
d2 = {'most_left_coordinate_genome' : [56, 1120, 1655], 'most_right_coordinate_genome' : [88, 1150, 1690], 'protein_ID' : ['XYZ_1', 'XYZ_2', 'XYZ_3']}
df_1 = pd.DataFrame(data=d1)
df_2 = pd.DataFrame(data=d2)
df_1['protein_ID'] = np.nan
def match_ranges(df1: pd.DataFrame, df2: pd.DataFrame):
for index, row_2 in df2.iterrows():
for index_1, row_1 in df1.iterrows():
if (row_1['geno_start'] >= row_2['most_left_coordinate_genome']) & (row_1['geno_end'] <= row_2['most_right_coordinate_genome']):
df1['protein_ID'].iloc[index_1] = row_2['protein_ID']
elif (abs(row_1['geno_start'] - row_2['most_left_coordinate_genome']) < 30) & (row_1['geno_end'] <= row_2['most_right_coordinate_genome']):
df1['protein_ID'].iloc[index_1] = row_2['protein_ID']
elif (row_1['geno_start'] >= row_2['most_left_coordinate_genome']) & (abs(row_1['geno_end'] - row_2['most_right_coordinate_genome']) < 30):
df1['protein_ID'].iloc[index_1] = row_2['protein_ID']
match_ranges(df_1, df_2)