0

Given table:

ID LINE SITE DATE UNITS TOTAL
1 X AAA 02-May-2017 12 30
2 X AAA 03-May-2017 10 22
3 X AAA 04-May-2017 22 40
4 Z AAA 20-MAY-2017 15 44
5 Z AAA 21-May-2017 8 30
6 Z BBB 22-May-2017 10 32
7 Z BBB 23-May-2017 25 52
8 K CCC 02-Jun-2017 6 22
9 K CCC 03-Jun-2017 4 33
10 K CCC 12-Aug-2017 11 44
11 K CCC 13-Aug-2017 19 40
12 K CCC 14-Aug-2017 30 40

for each row if ID,LINE ,SITE equal to previous row (day) need to calculate as below (last day) and (last 3 days ) : Note that is need to insure date are consecutive under "groupby" of ID,LINE ,SITE columns

ID LINE SITE DATE UNITS TOTAL Last day Last 3 days
1 X AAA 02-May-2017 12 30 0 0
2 X AAA 03-May-2017 10 22 12/30 12/30
3 X AAA 04-May-2017 22 40 10/22 (10+12)/(30+22)
4 Z AAA 20-MAY-2017 15 44 0 0
5 Z AAA 21-May-2017 8 30 15/44 15/44
6 Z BBB 22-May-2017 10 32 0 0
7 Z BBB 23-May-2017 25 52 10/32 10/32
8 K CCC 02-Jun-2017 6 22 0 0
9 K CCC 03-Jun-2017 4 33 6/22 6/22
10 K CCC 12-Aug-2017 11 44 4/33 0
11 K CCC 13-Aug-2017 19 40 11/44 (11/44)
12 K CCC 14-Aug-2017 30 40 19/40 (11+19/44+40)
Senderke
  • 13
  • 4

1 Answers1

0

In this cases i usually do a for loop with groupby:

import pandas as pd
import numpy as np

#copied your table
table = pd.read_csv('/home/fm/Desktop/stackover.csv')
table.set_index('ID', inplace = True)
table[['Last day','Last 3 days']] = np.nan

for i,r in table.groupby(['LINE' ,'SITE']):
    #First subset non sequential dates
    limits_interval = pd.to_datetime(r['DATE']).diff() != '1 days'
    #First element is a false positive, as its impossible to calculate past days from first day
    limits_interval.iloc[0]=False

    ids_subset = r.index[limits_interval].to_list()
    ids_subset.append(r.index[-1]+1) #to consider all values
    id_start = 0

    for id_end in ids_subset:    
        r_sub = r.loc[id_start:id_end-1, :].copy()
        id_start = id_end 

        #move all values one day off, if the database is as in your example (1 line per day) wont have problems
        r_shifted = r_sub.shift(1)

        r_sub['Last day']=r_shifted['UNITS']/r_shifted['TOTAL']

        aux_units_cumsum = r_shifted['UNITS'].cumsum()
        aux_total_cumsum = r_shifted['TOTAL'].cumsum()

        r_sub['Last 3 days'] = aux_units_cumsum/aux_total_cumsum

        r_sub.fillna(0, inplace = True)

        table.loc[r_sub.index,:]=r_sub.copy()

You can make a function and apply in groupby, it would be cleaner: Apply function to pandas groupby. It would be more elegant. Wish I could help you, good luck

  • Hi , thank you so much . it work good . the problem is (as you mentioned ) day are not consecutive to all groups. in the example i gave(updated) , if you look at "LINE=K, SITE=CCC etc" dates has gaps . – Senderke Mar 22 '21 at 16:06
  • Hi, you're welcome. OBS: -The diff() function applied to datetime objects returns '1 day' if days are consecutive. -I need to append r.index[-1]+1 to consider all lines, the +1 compensates the -1 in r_sub assignment -the .copy() method avoid pandas to return a view – Felipe Miranda Mar 22 '21 at 16:52
  • One more thing... how do i extend this case to 7 day or 'n' days back ? The idea is look back for 3,7, n days and do the calcs' even if there are gaps of day or two within 7 days for example. – Senderke Mar 24 '21 at 08:23
  • i think rolling and timedelta might solve it but i am not sure how !? – Senderke Mar 24 '21 at 12:27
  • limits_interval = pd.to_datetime(r['DATE']).diff() >n n is the variable with n days. But iam not sure i understood your question. – Felipe Miranda Mar 25 '21 at 10:11