0

This is the sample table in panda data frame ( actual row count is approx. 200K)

Acct    Ent    Co    Period    Actuals
11    100    Global    Jan    1000
11    100    Global    Jan    1000
11    100    Global    Feb    1500
11    100    Global    Feb    1200
11    100    Global    Mar    1600
11    100    Global    Mar    1300

the output I intent to get is - a comparison between actual and Run rate ( group by period). Run rate means previous months Average. Suppose if I select period 'March' and run the query, I should get following output. Here 2350 in run rate column is average of Jan & Feb total (2000+2700)/2 = 2350

Acct    Ent    Co    Period    Actuals  Run Rate
11    100    Global    Mar    2900      2350

I' am able to achieve this through 'for' loop running through each row but it is not practical for heavy dataset. In case of huge dataset, loops takes more than 6-7 hours to run.

is there any way we can achieve this without creating 'for' loop? looking for some directions.

Following code I am using which is fine but as mentioned above not practical.

def runrate():
    global excel_data_df
    x=len(excel_data_df)-1
    runrate=[]
    
    for i in range (0,x+1):
        data1 =excel_data_df.loc[i]
        sum1 =0
        d=x
        list1=finder(data1.Period)
        if(data1.Period=="Jan"):
            runrate.append(0)
        elif(isinstance(list1,list)==False):
            runrate.append(sum1)
        else:
            
            for items in list1:
                d=i
                while(d>=0):
                    
                    data2=excel_data_df.loc[d]
                    if(items== data2.Period and data1.Account== data2.Account and data1.Country== data2.Country and data2.Functions==data1.Functions):
                        sum1=sum1+data2.Actuals
                    else:
                        pass
                    d=d-1
            sum1=sum1/len(list1)
            runrate.append(sum1)
    excel_data_df["runrate"]=runrate
  • This [question](https://stackoverflow.com/questions/22650833/pandas-groupby-cumulative-sum) almost answers your problem. Should be easy enough to convert to avg – noah Jan 13 '21 at 22:29
  • both the answers are helpful and working! – Siddalmia22 Jan 13 '21 at 23:05

3 Answers3

1

recreate the dateset

import pandas as pd

data = [(11,100,'Global','Jan',1000),
        (11,100,'Global','Jan',1000),
        (11,100,'Global','Feb',1500),
        (11,100,'Global','Feb',1200),
        (11,100,'Global','Mar',1600),
        (11,100,'Global','Mar',1300)]

df = pd.DataFrame(data, columns = ['Acct','Ent','Co','Period', 'Actuals'])
df

result of df:

   Acct Ent     Co      Period  Actuals
0   11  100     Global  Jan     1000
1   11  100     Global  Jan     1000
2   11  100     Global  Feb     1500
3   11  100     Global  Feb     1200
4   11  100     Global  Mar     1600
5   11  100     Global  Mar     1300

The code:

#---------------------------------------------------#
# group the data set by: 'Acct','Ent','Co','Period' #
#---------------------------------------------------#
# 
# I assume that 'Acct','Ent','Co' represent 1 set of records
# and that you want to sum the actuals by: period
#
# Also note, that i don't sort the df on period
# I think it is best, that you create a new column whereby
# you represent the months by number (or date)
df = df.groupby(['Acct','Ent','Co','Period'], as_index= False, sort=False).agg(Actuals = ('Actuals','sum'))

#---------------------------------------------#
# Calculate the cumulative sum, for each group #
#----------------------------------------------#
#
# Yet again, i assume that: 'Acct','Ent', 'Co' represent 1 group
#
df['cumsum'] = df.groupby(['Acct','Ent', 'Co'])['Actuals'].transform(pd.Series.cumsum)

#--------------------------------------------------------#
# Add an helping number which we can use for calculating
# # EDIT:  

# create a month index with a constant value of 1
df['month_index'] = 1
# Calculate the cumulative sum, per group 
df['month_index'] = df.groupby(['Acct','Ent', 'Co'])['month_index'].transform(pd.Series.cumsum)
# calculate the mean by month
df['mean per month'] = df['cumsum'] / df['month_index']

#------------------------------------------------------------#
# shift the column, mean per month, 1 place (to the next row #
#------------------------------------------------------------#
df['Run Rate'] = df['mean per month'].shift(1)

#-----------------------------#
# Show result, or filter more #
#-----------------------------#
df

result:

    Acct Ent    Co      Period  Actuals cumsum  month_index mean per month  Run Rate
0   11  100     Global  Jan     2000    2000    1   2000.000000     NaN
1   11  100     Global  Feb     2700    4700    2   2350.000000     2000.0
2   11  100     Global  Mar     2900    7600    3   2533.333333     2350.0
Dieter
  • 2,499
  • 1
  • 23
  • 41
  • looking for some example – Siddalmia22 Jan 13 '21 at 22:35
  • give me a second – Dieter Jan 13 '21 at 22:38
  • thanks Dieter, my eyes are rolling to see same problem can be solved through different ways. cheers! mate – Siddalmia22 Jan 13 '21 at 22:53
  • @Siddalmia22 - If you've tabular data (like a pandas dataframe). Try not to use for loops. Try to vectorize everything – Dieter Jan 13 '21 at 22:59
  • Dieter - you've taken an assumption that above dataset is same set of records, what if same set of data is repeated for different countries for example, Asia, US etc? run rate based on index is not working as 1st row can be Global, another can be Asia...any suggestion? – Siddalmia22 Jan 14 '21 at 14:36
  • @Siddalmia22 - then you've to change your groupby values - Currently Acct, Ent, Co are seen as 1 group. - Do you've some experience with the groupby statement? – Dieter Jan 14 '21 at 15:29
  • @Dieter- a bit i can handle to get summarization at different level but can't figure out what to change here so that run rate column should calculate in an order- for example 1st select country global and Period Jan-Mar, then go to next Country Asia and period Jan-Mar. – Siddalmia22 Jan 14 '21 at 17:01
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/227346/discussion-between-dieter-and-siddalmia22). – Dieter Jan 14 '21 at 17:54
1

You can use rolling, but for this specific problem, I think it is simpler and faster to just do the calcualtions and separately join the results as a dataframe.

Note: For multi-year datasets, specifying Mar won't cut it, so it's better to have a Year-Month column. Furthermore, you could make this dynamic to specify current month with datetime library.

a = (df[df['Period'] == 'Mar'].groupby(['Acct', 'Ent', 'Co','Period'], sort=False)
     ['Actuals'].sum().to_frame())
b = (df[df['Period'] != 'Mar'].groupby(['Acct', 'Ent', 'Co', 'Period'], sort=False)
     ['Actuals'].mean().rename('Run Rate').to_frame()
     .groupby(level=[0,1,2]).sum())
a.join(b).reset_index()
Out[1]: 
                         
Acct Ent Co     Period   Actual  Run Rate            
11   100 Global Mar        2900      2350
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • This is great David, one caveat though, run rate calculation gives here 1175 which is Average of 4 months, ideally it should be average of total Jan & Feb total (2000+2700)/2 = 2350, any thought? – Siddalmia22 Jan 13 '21 at 22:48
  • @Siddalmia22 excellent, if you get the chance let us know how much faster it made your process. – David Erickson Jan 13 '21 at 22:56
0

This method can be easily expanded for using with different years and more than three months:

Aggregate the data:

df_agg = df.groupby(['Acct','Ent','Co','Period'],sort=False).sum()

Create a column that records last month Actuals for the same Acct, Ent, and Co:

df_agg['Actuals-1'] = df_agg.groupby(['Acct','Ent','Co'],sort=False)['Actuals'].shift(1)

Create a column that records two months ago's Actuals for the same Acct, Ent, and Co:

df_agg['Actuals-2'] = df_agg.groupby(['Acct','Ent','Co'],sort=False)['Actuals'].shift(2)

Create the Run Rate column:

df_agg['Run Rate'] = (df_agg['Actuals-1'] + df_agg['Actuals-2'])/2

Drop the two helper columns:

df_agg = df_agg.drop(['Actuals-1','Actuals-2'], axis=1)

Display the dataset:

print(df_agg)

Results:

                        Actuals  Run Rate
Acct Ent Co     Period                   
11   100 Global Jan        2000       NaN
                Feb        2700       NaN
                Mar        2900    2350.0
Life is Good
  • 106
  • 9