Python pandas - add multiple columns to dataframe where rows and columns are co-dependent

Question

I have managed to get this working with for-loops but these are very slow on the large datasets I'm working with, so am trying to find a way to do this using pandas, groupby, apply and lamda functions instead.

import pandas as pd
example_df = pd.DataFrame({"scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2], 
                      "cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'], 
                      "wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
                      "par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
                      "prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
                      "amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
                      "book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0]})

for x in range(1, len(example_df['scen'])):

if (example_df['cusip'][x] == example_df['cusip'][x-1]):

# If bond matures, don't report book value
    if(example_df['par_val'][x] == 0):
        example_df['book_val'][x] = 0
    else:
        example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


    example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12

example_df

The tricky part is that each row's book value depends on the previous row's amortization value, while each amortization value depends on the book value in the same row. Looking at the responses to a similar question here, I think there may be a way to do this using global variables that keep track of the previous values.

Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

Something like:

def calc_bv(prin_cf, par_val, wal):
global bvalue, amort 
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort

bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df

score 0 · Accepted Answer · answered Nov 13 '18 at 12:11

There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.

# Python 3.6.0, Pandas 0.19.2

assert jpp(df).equals(original(df))

%timeit jpp(df)       # 929 µs per loop
%timeit original(df)  # 1.05 s per loop

Benchmarking code

Original:

def original(example_df):
    for x in range(1, len(example_df['scen'])):

        if (example_df['cusip'][x] == example_df['cusip'][x-1]):

        # If bond matures, don't report book value
            if(example_df['par_val'][x] == 0):
                example_df['book_val'][x] = 0
            else:
                example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


            example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
    return example_df

Numba:

from numba import njit

@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
    n = len(par)
    for i in range(1, n):
        if cusip[i] == cusip[i-1]:
            if par[i] == 0:
                book[i] == 0
            else:
                book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
            amort[i] = (book[i] - par[i]) / wal[i] / 12
    return book, amort


def jpp(df):
    df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
                                                    df['book_val'].values, df['amortization'].values,
                                                    df['prin_cf'].values, df['wal'].values)

    return df

Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look? — M.Hope, Dec 10 '18 at 15:29
@M.Hope, Sorry, can't think of anything. Get your work to install numba :) — jpp, Dec 10 '18 at 15:35

Python pandas - add multiple columns to dataframe where rows and columns are co-dependent

1 Answers1