Sum Pandas DF value in additional new columns according to a condition on a datetime column

Question

I've a simple Pandas Dataframe with four columns:

NRAPPOR; DSCARAT; IQTACAP; IQTAINT

2; 2006-12-31; 0.00; 60.00
2; 2007-01-31; 270.75; 150.05
2; 2007-02-28; 272.78; 148.02
2; 2007-03-31; 274.82; 145.98
2; 2007-04-30; 276.88; 143.92
... ... ... ...
5731; 2016-11-17; 1760.00; 240.00
5731; 2018-11-17; 1800.00; 200.00
5731; 2019-11-17; 1850.00; 150.00
5731; 2020-11-17; 1900.00; 100.00
5731; 2021-11-17; 1950.00; 50.00

where:
- NRAPPOR = loan ID
- NSCARAT = installment expiry date
- IQTACAP = Principal portion of installment
- IQTAINT = Interset portion of installment

For each NRAPP, I'd like to sum of IQTACAP and IQTAINT values in four distinct totalizers according to whether or not DSCARAT is less than the threshold date ('2020-03-17'
I want to sum IQTACAP in totCapOverdue if DSCADRAT is <= than the threshold date ('2020-03-17'
I want to sum IQTACAP in totCapToExpire if DSCADRAT is > than the threshold date ('2020-03-17'
I want to sum IQTAINT in totIntOverdue if DSCADRAT is <= than the threshold date ('2020-03-17'
I want to sum IQTAINT in totIntToExpire if DSCADRAT is > than the threshold date ('2020-03-17'

I'd like to obtain a New DF with 5 columns; NRAPPOR and the four totalizers

'This is my barbaric code:

'set threshold date
dataSoglia = '2020-03-17' 

totCapOverdue = 0
totIntOverdue = 0
totCapToExpire = 0
totIntToExpire = 0
rapportoPrev = 0

for index, row in df1.iterrows():

    'if NRAPPORT changes, I print the totalizer

    'I would prefer to obtain a new Dataframe with NRAPPOR and the four totalizer as new columns

    if((index[0]!=rapportoPrev) & (rapportoPrev!=0)):
        print(rapportoPrev,'\t', 'capOverdue: ', totCapOverdue, '\t', 'intOverdue: ', totIntOverdue, '\t','capToExpire: ', totCapToExpire,  '\t', 'intpToExpire: ', totIntToExpire)    


    'set totalizer to zero
    totCapOverdue = 0
    totIntOverdue = 0
    totCapToExpire = 0
    totIntToExpire = 0

if (index[1].strftime("%Y-%m-%d")  <= dataSoglia):
    totCapOverdue += row['IQTACAP']
    totIntOverdue += row['IQTAINT']
else:
    totCapToExpire += row['IQTACAP']
    totIntToExpire += row['IQTAINT']
rapportoPrev = index[0]
dataPrev=index[1]

This is my output:
2 capOverdue: 19999.999999999993 intOverdue: 4887.200000000001 capToExpire: 0 intpToExpire: 0
3 capOverdue: 123156.18000000002 intOverdue: 70519.02 capToExpire: 26843.820000000003 intpToExpire: 1528.9799999999996
4 capOverdue: 30000.0 intOverdue: 4965.180000000001 capToExpire: 0 intpToExpire: 0
5 capOverdue: 6000.000000000002 intOverdue: 167.1 capToExpire: 0 intpToExpire: 0
6 capOverdue: 18000.0 intOverdue: 2111.89 capToExpire: 0 intpToExpire: 0
7 capOverdue: 50000.00000000003 intOverdue: 8104.3 capToExpire: 0 intpToExpire: 0
8 capOverdue: 50000.00000000003 intOverdue: 15711.999999999996 capToExpire: 0 intpToExpire: 0
9 capOverdue: 70000.0 intOverdue: 18213.110000000004 capToExpire: 0 intpToExpire: 0
...

'Is there a better way to di that?
Thanks

Hi, it can be done using a combination of DataFrame methods `groupby`, `apply` and `query` (and maybe also converting dates to Timestamp type if they are string). If you can edit your question and add an example of the `df1` dataframe that one can copy and create, it will be easier to help you. — Itamar Katz, Apr 02 '20 at 21:13
Thanks Itamar. How can I add a sample of my dataframe to facilitate you? — Tonono, Apr 03 '20 at 08:27
Maybe it's easier if you just tell me the datatype of the dates column, is it pandas Timestamp? datetime? something else? I see it has a `strftime` method but just to be sure — Itamar Katz, Apr 03 '20 at 09:27
NRAPPOR int64 DSCARAT datetime64[ns] IQTACAP float64 IQTAINT float64 I'm sorry, but, as you can see, it's my first question on SOF — Tonono, Apr 03 '20 at 10:09
np, one last remark, when you write about a dataframe in SO question, it's helpful to mention which are indices and which are columns. I can guess it from your code but it's better not having to guess.. (posting answer soon) — Itamar Katz, Apr 03 '20 at 10:42
To @ItamarKatz point, please see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for future pandas questions; reproducible samples make it much easier on you and your potential helpers to understand your problem — G. Anderson, Apr 03 '20 at 14:45

score 0 · Answer 1 · answered Apr 03 '20 at 11:14

0

First I create a pandas DataFrame with subset of your data (that's what you can do in your questions to make your code reproducible):

import pandas as pd
data = [[2, '2006-12-31', 0.0, 60.0], 
        [2, '2007-01-31', 270.75, 150.05],
        [2, '2007-02-28', 272.78, 148.02],
        [2, '2007-03-31', 274.82, 145.98],
        [2, '2007-04-30', 276.88, 143.92],
        [5731, '2016-11-17', 1760.0, 240.0],
        [5731, '2018-11-17', 1800.0, 200.0],
        [5731, '2019-11-17', 1850.0, 150.0],
        [5731, '2020-11-17', 1900.0, 100.0],
        [5731, '2021-11-17', 1950.0, 50.0]]
df = pd.DataFrame(data, columns = ['NRAPPOR', 'DSCARAT', 'IQTACAP','IQTAINT'])
df = df.set_index('NRAPPOR')
#--- convert string to datetime
df['DSCARAT'] = df['DSCARAT'].apply(lambda ts: pd.Timestamp(ts))

Now I use pandas groupby to split the table according to the ID, and apply a function using apply to each group: query for the relevant dates and sum:

th_date = pd.Timestamp('2007-03-01')
overdue = df.groupby('NRAPPOR').apply(lambda g: g.query('DSCARAT < @th_date').sum())
expire = df.groupby('NRAPPOR').apply(lambda g: g.query('DSCARAT >= @th_date').sum())

answered Apr 03 '20 at 11:14

Itamar Katz

9,544
5
42
74

Thanks a lot, Itamar Is it possible to obtain the totalizers in the same row, to be exported in a CSV file? – Tonono Apr 03 '20 at 13:02
And is it possible to avoid a total for DSCARAT? What does it represent? NRAPPOR 2 IQTACAP 543.53 IQTAINT 358.07 3 DSCARAT 2007-02-26 00:00:00 IQTACAP 533.6 IQTAINT 700 4 IQTACAP 406.12 IQTAINT 390.08 5 IQTACAP 238.87 IQTAINT 77.73 – Tonono Apr 03 '20 at 13:14
You can drop the total of DSCARAT as you don't need it. And as for combining the output to a single dataframe that can be exported to csv, you should get familiar with pandas and combining/joining/concatenating arrays. – Itamar Katz Apr 04 '20 at 10:06
Itamar, this is overdue output: {(2, 'IQTACAP'): 19999.999999999993, (2, 'IQTAINT'): 4887.200000000001, (3, 'IQTACAP'): 123156.18000000002, (3, 'IQTAINT'): 70519.02, (4, 'IQTACAP'): 30000.0, (4, 'IQTAINT'): 4965.18 ... And this is expire output: {(2, 'DSCARAT'): 0.0, (2, 'IQTACAP'): 0.0, (2, 'IQTAINT'): 0.0, (3, 'IQTACAP'): 26843.82, (3, 'IQTAINT'): 1528.98, (4, 'DSCARAT'): 0.0, (4, 'IQTACAP'): 0.0, (4, 'IQTAINT'): 0.0, (5, 'IQTAINT'): 0.0, (6, 'DSCARAT'): 0.0 .. (6, 'IQTACAP'): 0.0, (6, 'IQTAINT'): 0.0 ... Why DSCARAT appears? How to drop it? – Tonono Apr 06 '20 at 17:37

Sum Pandas DF value in additional new columns according to a condition on a datetime column

1 Answers1