0

I'm looking to do vintage analysis in pandas with some csv data, and I'm wondering if I can streamline the process vs. iterating through a dataframe to create the vintage analysis. For example, I have a dataset similar to below in csv and I've read it into a dataframe default.

Origination Default Month Default Amount
1Q20 6 3000
1Q20 8 2000
1Q20 4 1000
2Q20 6 3000
2Q20 8 2500
1Q20 6 3000
2Q20 0 0
3Q20 3 1000
2Q20 4 3000
3Q20 6 4000

Now I know I can easily hit the dataframe default.groupby(['Origination']).sum() which tells me by quarter what defaulted, but doesn't give me the time dimension in months. Similarly grouping by the columns will give me defaults by month but not differtiate by quarter.

I'm guessing using something like this this question and grouping by two dimensions default.groupby(['Origination','Default Month']).sum() would allow me to split it into quarters and by months, BUT is there a way to effectively do that in a vintage format?

I created a blank dataframe with the form:

1 2 3 4 5 6 7 8
1Q20 0 0 0 0 0 0 0 0
2Q20 0 0 0 0 0 0 0 0
3Q20 0 0 0 0 0 0 0 0

I'm trying to transform that dataset by summation into a vintage table like so, note for example 1Q20 had two loans default in the 6 month so that is now $6,000.

Default 1 2 3 4 5 6 7 8
1Q20 0 0 0 1000 0 6000 0 2000
2Q20 0 3000 0 0 0 3000 0 2500
3Q20 0 0 1000 0 0 4000 0 0

Is there an easy way to do this I'm overlooking, or is it a dual group.by and then trying to assign the data by category into the dataframe?

merv
  • 67,214
  • 13
  • 180
  • 245
Benloper
  • 448
  • 4
  • 13
  • For `2Q20` , month 2 has a value 3000, how? – anky Aug 21 '21 at 18:17
  • It’s just dummy data, but it’s months from origination to default in the real dataset - so it’s not defaulting in February, but rather it defaulted 2 months after origination and was in the second quarter vintage. – Benloper Aug 21 '21 at 19:42

1 Answers1

1

You should try:

df.set_index(['Origination',' Default Month']).unstack(level=1)

Alternatively, if you have duplicates, use pivot_table:

(pd.pivot_table(df,
                index='Origination',
                columns=['Default Month'],
                values='Default Amount',
                aggfunc='sum',
                fill_value=0)
   .reindex(range(12), axis=1)  # those two lines to ensure 
   .fillna(0, downcast='infer') # all columns are present
)

output:

Default Month  0   1   2     3     4   5     6   7     8   9   10  11
Origination                                                          
1Q20            0   0   0     0  1000   0  6000   0  2000   0   0   0
2Q20            0   0   0     0  3000   0  3000   0  2500   0   0   0
3Q20            0   0   0  1000     0   0  4000   0     0   0   0   0
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks - very helpful. I suppose for large datasets it might not matter, but in theory this explanation would not have a column for '1' or '5' or '7'. How would you re-index that to show those columns as blanks so that you could plot the actual vintage curve instead of jumping from month 4 to month 6. – Benloper Aug 23 '21 at 13:04
  • @Benloper You can add `.reindex(range(12), axis=1).fillna(0, downcast='infer')` to the pipeline. I updated the answer – mozway Aug 23 '21 at 13:15