Have
import pandas as pd
afltv = pd.DataFrame({'FICO': [0, 0, 700, 700],
'LTV': [0, 70, 0, 70],
'Adj': [10, 11, 12, 13]})
gfltv = pd.DataFrame({'FICO': [0, 0, 700, 700],
'LTV': [0, 70, 0, 70],
'Adj': [1, 2, 3, 4]})
df = pd.DataFrame({'Investor': ['a','a','e','f'],
'FICO': [600, 699, 700, 701],
'LTV': [69, 70, 71, 90]})
df2 = pd.DataFrame({'Investor': ['a','a','e','f'],
'FICO': [600, 699, 700, 701],
'LTV': [69, 70, 71, 90]})
dflist = [df,df2]
Want
I want to append a column of Adj
values to each dataframe in dflist
based on the following logic:
- The
Adj
values are found by matching the max le (maximum less-than-or-equal-to) value in the columns[FICO, LTV]
against the lookup dataframes (afltv
andgfltv
). - The lookup dataframes are used conditionally: If the
Investor
column of a input row is'a'
, thenafltv
is used. Otherwisegfltv
is used.
Desired output:
print(df) # the same for df2
Investor FICO LTV value
0 a 600 69 10.0
1 a 699 70 11.0
2 e 700 71 4.0
3 f 701 90 4.0
Previous Attempt
I have an implementation which takes almost 3 hours to complete for 400K rows.
Based on the investor column, set the
table
variable. For example, ifdf.Investor == 'a'
thentable = afltv
Convert the
df.FICO
anddf.ltv
(same for df2) to the nearest value intable.FICO
andtable.LTV
respectively, without going over. For example, if df.FICO = 699, and the values in table.FICO are 0 and 700, the conversion should result in 0.Store the results from step 2 in the variables
cscore
andlscore
(the same process described in step 2 is done for each variable)use .loc with the variables from step 3 to return a scalar value from the
table
variable set in step 1
def find_value(row):
###Based on df.Investor (passed as row.Investor), set 'table' to be one
### of the df's established above - those df's contain the desired
### results values in the 'adj' column
if row['Investor'] == 'a':
table = afltv.copy()
else:
table = gfltv.copy()
###Convert FICO (described in step 2) and store in cscore
table.drop(table[table.FICO>row['FICO']].index, inplace=True)
table.reset_index(drop=True, inplace=True)
cscore = table.loc[(table['FICO']-row['FICO']).abs().argsort(), 'FICO'].values[0]
###Convert LTV as described in step 2
table.drop(table[table.LTV>row['LTV']].index, inplace=True)
table.reset_index(drop=True, inplace=True)
lscore = table.loc[(table['LTV']-row['LTV']).abs().argsort(), 'LTV'].values[0]
###Use .loc and the variables we set in order to return a scalar value from
### table.adj
adj = table.loc[(table['LTV']==lscore) & (table['FICO']==cscore), 'Adj'].values
return adj
The desired output is produced this way
for i in dflist:
i['value'] = i.apply(find_value, axis=1).astype(float)
I also have multiple functions that act in a similar way, take a row and return a scalar value. I've tried passing the column series into the function instead to speed it up, but I have comparisons that return booleans, so that doesn't appear to work.
I am seeking for any suggestions on improvements. I've got a few questions on optimization:
Is looping through dataframes to apply a function a bad practice? Should I combine them into one df and apply once?
Is creating new df's within the function I apply and running .loc within that function inefficient?
Based on question 2, would it make more sense to convert the values I use for .loc (FICO and LTV) in the function, but skip the .loc part? I could potentially do a merge outside of the function, rather than .loc.