How to sum values under GroupBy and consecutive date conditions?

Question

Given table:

ID	LINE	SITE	DATE	UNITS	TOTAL
1	X	AAA	02-May-2017	12	30
2	X	AAA	03-May-2017	10	22
3	X	AAA	04-May-2017	22	40
4	Z	AAA	20-MAY-2017	15	44
5	Z	AAA	21-May-2017	8	30
6	Z	BBB	22-May-2017	10	32
7	Z	BBB	23-May-2017	25	52
8	K	CCC	02-Jun-2017	6	22
9	K	CCC	03-Jun-2017	4	33
10	K	CCC	12-Aug-2017	11	44
11	K	CCC	13-Aug-2017	19	40
12	K	CCC	14-Aug-2017	30	40

for each row if ID,LINE ,SITE equal to previous row (day) need to calculate as below (last day) and (last 3 days ) : Note that is need to insure date are consecutive under "groupby" of ID,LINE ,SITE columns

ID	LINE	SITE	DATE	UNITS	TOTAL	Last day	Last 3 days
1	X	AAA	02-May-2017	12	30	0	0
2	X	AAA	03-May-2017	10	22	12/30	12/30
3	X	AAA	04-May-2017	22	40	10/22	(10+12)/(30+22)
4	Z	AAA	20-MAY-2017	15	44	0	0
5	Z	AAA	21-May-2017	8	30	15/44	15/44
6	Z	BBB	22-May-2017	10	32	0	0
7	Z	BBB	23-May-2017	25	52	10/32	10/32
8	K	CCC	02-Jun-2017	6	22	0	0
9	K	CCC	03-Jun-2017	4	33	6/22	6/22
10	K	CCC	12-Aug-2017	11	44	4/33	0
11	K	CCC	13-Aug-2017	19	40	11/44	(11/44)
12	K	CCC	14-Aug-2017	30	40	19/40	(11+19/44+40)

Felipe Miranda · Accepted Answer · 2021-03-22T16:46:10.563

0

In this cases i usually do a for loop with groupby:

import pandas as pd
import numpy as np

#copied your table
table = pd.read_csv('/home/fm/Desktop/stackover.csv')
table.set_index('ID', inplace = True)
table[['Last day','Last 3 days']] = np.nan

for i,r in table.groupby(['LINE' ,'SITE']):
    #First subset non sequential dates
    limits_interval = pd.to_datetime(r['DATE']).diff() != '1 days'
    #First element is a false positive, as its impossible to calculate past days from first day
    limits_interval.iloc[0]=False

    ids_subset = r.index[limits_interval].to_list()
    ids_subset.append(r.index[-1]+1) #to consider all values
    id_start = 0

    for id_end in ids_subset:    
        r_sub = r.loc[id_start:id_end-1, :].copy()
        id_start = id_end 

        #move all values one day off, if the database is as in your example (1 line per day) wont have problems
        r_shifted = r_sub.shift(1)

        r_sub['Last day']=r_shifted['UNITS']/r_shifted['TOTAL']

        aux_units_cumsum = r_shifted['UNITS'].cumsum()
        aux_total_cumsum = r_shifted['TOTAL'].cumsum()

        r_sub['Last 3 days'] = aux_units_cumsum/aux_total_cumsum

        r_sub.fillna(0, inplace = True)

        table.loc[r_sub.index,:]=r_sub.copy()

You can make a function and apply in groupby, it would be cleaner: Apply function to pandas groupby. It would be more elegant. Wish I could help you, good luck

edited Mar 22 '21 at 16:46

answered Mar 22 '21 at 15:35

Felipe Miranda

56
4

Hi , thank you so much . it work good . the problem is (as you mentioned ) day are not consecutive to all groups. in the example i gave(updated) , if you look at "LINE=K, SITE=CCC etc" dates has gaps . – Senderke Mar 22 '21 at 16:06
Hi, you're welcome. OBS: -The diff() function applied to datetime objects returns '1 day' if days are consecutive. -I need to append r.index[-1]+1 to consider all lines, the +1 compensates the -1 in r_sub assignment -the .copy() method avoid pandas to return a view – Felipe Miranda Mar 22 '21 at 16:52
One more thing... how do i extend this case to 7 day or 'n' days back ? The idea is look back for 3,7, n days and do the calcs' even if there are gaps of day or two within 7 days for example. – Senderke Mar 24 '21 at 08:23
i think rolling and timedelta might solve it but i am not sure how !? – Senderke Mar 24 '21 at 12:27
limits_interval = pd.to_datetime(r['DATE']).diff() >n n is the variable with n days. But iam not sure i understood your question. – Felipe Miranda Mar 25 '21 at 10:11

How to sum values under GroupBy and consecutive date conditions?

1 Answers1