0

Have

import pandas as pd
afltv = pd.DataFrame({'FICO': [0, 0, 700, 700],
                      'LTV': [0, 70, 0, 70],
                      'Adj': [10, 11, 12, 13]})
gfltv = pd.DataFrame({'FICO': [0, 0, 700, 700],
                      'LTV': [0, 70, 0, 70],
                      'Adj': [1, 2, 3, 4]})
df = pd.DataFrame({'Investor': ['a','a','e','f'],
                      'FICO': [600, 699, 700, 701],
                      'LTV': [69, 70, 71, 90]})
df2 = pd.DataFrame({'Investor': ['a','a','e','f'],
                      'FICO': [600, 699, 700, 701],
                      'LTV': [69, 70, 71, 90]})
dflist = [df,df2]

Want

I want to append a column of Adj values to each dataframe in dflist based on the following logic:

  • The Adj values are found by matching the max le (maximum less-than-or-equal-to) value in the columns [FICO, LTV] against the lookup dataframes (afltv and gfltv).
  • The lookup dataframes are used conditionally: If the Investor column of a input row is 'a', then afltv is used. Otherwise gfltv is used.

Desired output:

print(df)  # the same for df2

  Investor  FICO  LTV  value
0        a   600   69   10.0
1        a   699   70   11.0
2        e   700   71    4.0
3        f   701   90    4.0

Previous Attempt

I have an implementation which takes almost 3 hours to complete for 400K rows.

  1. Based on the investor column, set the table variable. For example, if df.Investor == 'a' then table = afltv

  2. Convert the df.FICO and df.ltv (same for df2) to the nearest value in table.FICO and table.LTV respectively, without going over. For example, if df.FICO = 699, and the values in table.FICO are 0 and 700, the conversion should result in 0.

  3. Store the results from step 2 in the variables cscore and lscore (the same process described in step 2 is done for each variable)

  4. use .loc with the variables from step 3 to return a scalar value from the table variable set in step 1

def find_value(row):

###Based on df.Investor (passed as row.Investor), set 'table' to be one 
###   of the df's established above - those df's contain the desired 
###   results values in the 'adj' column

    if row['Investor'] == 'a':
        table = afltv.copy()
    else:
        table = gfltv.copy() 
       
###Convert FICO (described in step 2) and store in cscore 

    table.drop(table[table.FICO>row['FICO']].index, inplace=True)
    table.reset_index(drop=True, inplace=True)
    cscore = table.loc[(table['FICO']-row['FICO']).abs().argsort(), 'FICO'].values[0]

###Convert LTV as described in step 2

    table.drop(table[table.LTV>row['LTV']].index, inplace=True)
    table.reset_index(drop=True, inplace=True)
    lscore = table.loc[(table['LTV']-row['LTV']).abs().argsort(), 'LTV'].values[0]

###Use .loc and the variables we set in order to return a scalar value from
###   table.adj

    adj = table.loc[(table['LTV']==lscore) & (table['FICO']==cscore), 'Adj'].values

    return adj 

The desired output is produced this way

for i in dflist:
    i['value'] = i.apply(find_value, axis=1).astype(float)

I also have multiple functions that act in a similar way, take a row and return a scalar value. I've tried passing the column series into the function instead to speed it up, but I have comparisons that return booleans, so that doesn't appear to work.

I am seeking for any suggestions on improvements. I've got a few questions on optimization:

  1. Is looping through dataframes to apply a function a bad practice? Should I combine them into one df and apply once?

  2. Is creating new df's within the function I apply and running .loc within that function inefficient?

  3. Based on question 2, would it make more sense to convert the values I use for .loc (FICO and LTV) in the function, but skip the .loc part? I could potentially do a merge outside of the function, rather than .loc.

Bill Huang
  • 4,491
  • 2
  • 13
  • 31
  • 1
    May you provide sample data in a [reproducible way](https://stackoverflow.com/questions/20109391) and the expected output? Otherwise people won't be able to test. – Bill Huang Nov 05 '20 at 15:43
  • Great point - I will put together sample data and update the post. Thanks – Conor Murphy Nov 05 '20 at 16:47
  • Added sample data and expected output, thanks for the feedback! – Conor Murphy Nov 05 '20 at 19:33
  • May you explain the logic for obtaining the `value` column IN CLEAR WORDS? E.g., a list of well-defined rules or a sequence of steps to find the value. NO VAGUE WORDS. Almost all your terminology such as "match", "closest", "the table I want to look in", etc. were not clearly defined. I'd suggest that you shouldn't expect the potential answerers to infer these logic from your lengthy code. Try to focus on defining the requirements of the output instead of describing the (unwanted version of) code itself. – Bill Huang Nov 05 '20 at 21:00
  • Updated the question to be more clear, thanks again for feedback. The end of your comment implies that I don't want the code to function the way it currently does, but it does work properly right now, I'm getting the correct result - I'm hoping to get suggestions on where I can optimize my code so it runs more quickly. This is why I explain the logic behind what I'm doing, in case their is a more efficient way to complete the same step – Conor Murphy Nov 06 '20 at 14:21
  • IMHO the logic is best explained in a code-independent way, especially when your current code is already in need of improvements. A concise time-saving statement is: "I'd like to find the `Adj` value by matching the 'maximum less-than-or-equal-to value' in columns [`FICO`, `LTV`] for the input dataframes against the lookup dataframes". Such a solid description plus a concrete output example are what really matter to the potential answerers. And this would definitely save time for both parties. :) – Bill Huang Nov 06 '20 at 18:49

1 Answers1

0

Keys

  1. The key among the keys is np.searchsorted(), which addresses the logic of "maximum le (less than or equal to)" perfectly. The function is also vectorized according to the documentation.
  2. The *fltv dataframes are in fact "flattened" 2D tables of (FICO x LTV). If they are stacked back, then the Adj values can be found directly using the coordinates found in 1..
  3. Use integer array indexing to get the values in 2..

The code completes 800k rows (400k rows in df and df2 respectively) within 0.53 seconds.

Code

import pandas as pd
import numpy as np

# Use *fltv, df, df2, dflist as given

#  cond =  0    , 1
ls_fltv = [afltv, gfltv]

# construct lookup tables as 2D (RICO x LTV) arrays
ls_tb = [fltv.sort_values(["FICO", "LTV"])
             .set_index(["FICO", "LTV"])["Adj"]
             .unstack(level=-1) for fltv in ls_fltv]

def search_subdf(df, cond):
    """Search on a subset of df based on condition"""

    # get df subset
    if cond == 0:
        df_sub = df[df['Investor'] == 'a']
    elif cond == 1:
        df_sub = df[df['Investor'] != 'a']

    # get lookup table
    tb = ls_tb[cond]

    # Search FICO in the index and LTV in the columns
    FICO_where = np.searchsorted(tb.index.values, df_sub["FICO"].values, side="right") - 1
    LTV_where = np.searchsorted(tb.columns.values, df_sub["LTV"].values, side="right") - 1

    # append the Adj column
    return df_sub.assign(Adj=tb.values[FICO_where, LTV_where])
    
def search(df, n_cond):
    """Search through all conditions"""
    # compute the results from cond 0 and 1, and concat vertically
    return pd.concat([search_subdf(df, i) for i in range(n_cond)])

# execute
for i, item in enumerate(dflist):
    dflist[i] = search(item, len(ls_fltv))

Output

Result

for item in dflist:
    print(item)
    print()

  Investor  FICO  LTV  Adj
0        a   600   69   10
1        a   699   70   11
2        e   700   71    4
3        f   701   90    4

  Investor  FICO  LTV  Adj
0        a   600   69   10
1        a   699   70   11
2        e   700   71    4
3        f   701   90    4

The lookup tables

for tb in ls_tb:
    print(tb)
    print()

LTV   0   70
FICO        
0     10  11
700   12  13

LTV   0   70
FICO        
0      1   2
700    3   4

Notes

There are my suggestions on optimization:

  • Use vectorized operations on numpy arrays when efficiency becomes a problem.
    • Don't reinvent the wheel. Go find the wheel if the logic does not seem to be an extremely rare one.
  • Avoid explicit iteration over pandas dataframes.
    • If a for loop or .apply is inevitable (they are similar efficiency-wise), try to work on arrays instead of dataframes.
    • Do not use .iterrows(), .itertuples() or .append().
Bill Huang
  • 4,491
  • 2
  • 13
  • 31