0

This is my first question on forum. Thanks for any help!

I wrote nested for loop based on df.iterrows () (sic.) and it takes a huuuuuge amount of time to perform. I need to assing value from one dataframe into another one by checking all the cells in described condition. Can you just help me to make it effective? (multiprocessing, apply method, vectorization or anything else?) Would be so grateful! :)

Sample data:

import pandas as pd
import numpy as np

d1 = {'geno_start' : [60, 1120, 1660], 'geno_end' : [90, 1150, 1690], 'original_subseq' : ['AAATGCCTGAACCTTGGAATTGGA',  'AAATGCCTGAACCTTGGAATTGGA', 'AAATGCCTGAACCTTGGAATTGGA']}

d2 = {'most_left_coordinate_genome' : [56, 1120, 1655], 'most_right_coordinate_genome' : [88, 1150, 1690], 'protein_ID' : ['XYZ_1', 'XYZ_2', 'XYZ_3']}

df_1 = pd.DataFrame(data=d1)
df_2 = pd.DataFrame(data=d2)

df_1['protein_ID'] = np.nan


def match_ranges(df1: pd.DataFrame, df2: pd.DataFrame):

for index, row_2 in df2.iterrows():
    for index_1, row_1 in df1.iterrows():
        if (row_1['geno_start'] >= row_2['most_left_coordinate_genome']) & (row_1['geno_end'] <= row_2['most_right_coordinate_genome']):
            df1['protein_ID'].iloc[index_1] = row_2['protein_ID']
                
        elif (abs(row_1['geno_start'] - row_2['most_left_coordinate_genome']) < 30) & (row_1['geno_end'] <= row_2['most_right_coordinate_genome']):
            df1['protein_ID'].iloc[index_1] = row_2['protein_ID']
                
        elif (row_1['geno_start'] >= row_2['most_left_coordinate_genome']) & (abs(row_1['geno_end'] - row_2['most_right_coordinate_genome']) < 30):
            df1['protein_ID'].iloc[index_1] = row_2['protein_ID']

match_ranges(df_1, df_2)

Desired output:

McsLk
  • 3
  • 2
  • 1
    Can you explain what you are doing in your looping? What is the logic? Also, please provide a sample input dataframe and expected output dataframe. – Scott Boston Sep 10 '20 at 15:20
  • I have two dataframes: genomics and protein data with some genomic ranges. I'm trying to match ranges of genomic responsible for protein sequence. At the time of matching the range, I want to assign the name of a specific protein to the second of the first analyzed dataframe. I can't use exact merge because of genomic subsequence analysis. df_1 has columns: geno_start, geno_end, subsequence df_2 has columns: most_left_coordinate_genome, most_right_coordinate_genome, protein_ID, protein_sequence I want to assign df_2 protein_ID to a matched by range row in df_1. – McsLk Sep 10 '20 at 15:34
  • Edit your question and provide some sample data and expected output so we can check the solution we are presenting. But, yes, we can make this a lot more efficient. – Scott Boston Sep 10 '20 at 15:42
  • Okey, I just edited my question. :) – McsLk Sep 10 '20 at 16:07
  • Question? How big are your actual dataframes, number of rows? An estimate is fine? – Scott Boston Sep 10 '20 at 16:52
  • 1
    df_1 has almost 100k of rows, df_2 is almost 6k... but I have few hundred dataframes to process so I'm looking for much efficient way – McsLk Sep 10 '20 at 17:15
  • https://stackoverflow.com/a/43594038/6361531 May not work, but in the comments there are a couple of other solutions. – Scott Boston Sep 10 '20 at 17:41

1 Answers1

0

Here is a way that goes from 2 for-loops to 1. I re-named a couple columns to cut line width.

First, create the data frames:

import pandas as pd

d1 = {'geno_start' : [60, 1120, 1660], 
      'geno_end' : [90, 1150, 1690], 
      'original_subseq' : ['AAATGCCTGAACCTTGGAATTGGA',  
                           'AAATGCCTGAACCTTGGAATTGGA', 
                           'AAATGCCTGAACCTTGGAATTGGA'],}
d2 = {'left' : [56, 1120, 1655], 
      'right' : [88, 1150, 1690], 
      'protein_ID' : ['XYZ_1', 'XYZ_2', 'XYZ_3']}
df_1 = pd.DataFrame(data=d1)
df_1['protein_ID'] = '?'
df_1['rule'] = '?'
df_2 = pd.DataFrame(data=d2)

Second, populate the protein_ID column in the first data frame (i.e., with genome start, genome end):

for g in df_1.itertuples():
    
    # Rule A: left most <= geno start < geno end <= right-most
    # LM-----------------------RM   left- and right-most
    #     GS-----------GE           genome start, end
    if ((df_2['left'] <= g.geno_start) & (g.geno_end <= df_2['right'])).any():
        mask = (df_2['left'] <= g.geno_start) & (g.geno_end <= df_2['right'])
        df_1.at[g.Index, 'protein_ID'] = df_2.loc[mask, 'protein_ID'].values[0]
        df_1.at[g.Index, 'rule'] = 'Rule A'
        
        
    # Rule B: geno start before left-most
    #      LM-----------------RM
    #  GS-----------------GE
    elif ((df_2['left'] - g.geno_start < 30) & (g.geno_end <= df_2['right'])).any():
        mask = (df_2['left'] - g.geno_start < 30) & (g.geno_end <= df_2['right'])
        df_1.at[g.Index, 'protein_ID'] = df_2.loc[mask, 'protein_ID'].values[0]
        df_1.at[g.Index, 'rule'] = 'Rule B'

    # Rule C: geno end after right-most 
    #      LM-----------------RM
    #          GS-----------------GE

    elif ((df_2['left'] <= g.geno_start) & (g.geno_end - df_2['right'] < 30)).any():
        mask = (df_2['left'] <= g.geno_start) & (g.geno_end - df_2['right'] < 30)
        df_1.at[g.Index, 'protein_ID'] = df_2.loc[mask, 'protein_ID'].values[0]
        df_1.at[g.Index, 'rule'] = 'Rule C'
        
    else:
        pass

print(df_1)
   geno_start  geno_end           original_subseq protein_ID    rule
0          60        90  AAATGCCTGAACCTTGGAATTGGA      XYZ_1  Rule C
1        1120      1150  AAATGCCTGAACCTTGGAATTGGA      XYZ_2  Rule A
2        1660      1690  AAATGCCTGAACCTTGGAATTGGA      XYZ_3  Rule A
jsmart
  • 2,921
  • 1
  • 6
  • 13
  • Timing note: this approach processes 100k row (genome) vs 10k row (left/right-most) data frames in 35 seconds (on a 3-year-old desktop computer). It's test data, but gives order-of-magnitude results. – jsmart Sep 10 '20 at 19:57