Sum dataframe column by conditional row criteria

Question

How do I summarizes a column based on a dynamic value in the existing row? Using the example below, I'd like to iterate over each row (x), calculate the sum of all Clicks where Date == x.Date_Yesterday, and add the total as a new column.

Input data:

df = pd.DataFrame({
    'Date': ['2021-09-14','2021-09-14','2021-09-14','2021-09-13','2021-09-12','2021-09-12','2021-09-11'],
    'Date_Yesterday': ['2021-09-13','2021-09-13','2021-09-13','2021-09-12','2021-09-11','2021-09-11','2021-09-10'],
    'Clicks': [100,100,100,50,10,10,1]
})


   Date           Date_Yesterday  Clicks
0  2021-09-14     2021-09-13      100
1  2021-09-14     2021-09-13      100
2  2021-09-14     2021-09-13      100
3  2021-09-13     2021-09-12      50
4  2021-09-12     2021-09-11      10
5  2021-09-12     2021-09-11      10
6  2021-09-11     2021-09-10      1

Desired Output data:

Date          Date_Yesterday   Clicks  Total_Clicks_Yesterday
2021-09-14    2021-09-13       100     50
2021-09-14    2021-09-13       100     50
2021-09-14    2021-09-13       100     50
2021-09-13    2021-09-12       50      20
2021-09-12    2021-09-11       10      1
2021-09-12    2021-09-11       10      1
2021-09-11    2021-09-10       1       N/A

Calculating the Total_Clicks_Yesterday is simple with a static value:

clicks_yesterday = df['Total_Clicks_Yesterday'] = df.loc[df['Date'] == '2021-09-13', 'Clicks'].sum()
print(clicks_yesterday)

         Date Date_Yesterday  Clicks  Total_Clicks_Yesterday
0  2021-09-14     2021-09-13     100                      50
1  2021-09-14     2021-09-13     100                      50
2  2021-09-14     2021-09-13     100                      50
3  2021-09-13     2021-09-12      50                      50
4  2021-09-12     2021-09-11      10                      50
5  2021-09-12     2021-09-11      10                      50
6  2021-09-11     2021-09-10       1                      50

but I'm not sure how to make it dynamic for each line item?

score 7 · Accepted Answer · answered Sep 14 '21 at 21:12

you can groupby the column Date and sum the Clicks to get the number of clicks each day. Then use map on the column Date_yesterday with the result of the groupby operation to align the number of clicks with the previous day

df['Total_Clicks_Yesterday'] = df['Date_Yesterday'].map(df.groupby('Date')['Clicks'].sum())
print(df)
         Date Date_Yesterday  Clicks  Total_Clicks_Yesterday
0  2021-09-14     2021-09-13     100                    50.0
1  2021-09-14     2021-09-13     100                    50.0
2  2021-09-14     2021-09-13     100                    50.0
3  2021-09-13     2021-09-12      50                    20.0
4  2021-09-12     2021-09-11      10                     1.0
5  2021-09-12     2021-09-11      10                     1.0
6  2021-09-11     2021-09-10       1                     NaN

This works great -- I used a very simplified example with only 1 dimension (`Date_Yesterday`) to group by. How would this be extended to handle multiple dimensions in the groupby? For example, if there was another column `Page Name` and I want to group by `Date_Yesterday` and `Page Name`. — David Beaudway, Sep 15 '21 at 15:21
@DavidBeaudway you can use several columns in the `groupby(['Date_Yesterday','Page_Name'])...`, but then you cant't use `map` because it workds only on a Series (1 column) and not several. you will need to use `merge`, more like the other answer here — Ben.T, Sep 15 '21 at 15:35

score 2 · Answer 2 · answered Sep 14 '21 at 21:15

in one line:

pd.merge(df, df.groupby('Date')['Clicks'].sum().rename('Total_Clicks_Yesterday'), how='left', left_on='Date_Yesterday', right_index=True)

Here I'm merging two dataframes: the original dataframe, and a second one which is the groupby (documentation).

The groupby is grouping over the current date, and summing it up:

df.groupby('Date')['Clicks'].sum().rename('Total_Clicks_Yesterday')
Out[16]: 
Date
2021-09-11      1
2021-09-12     20
2021-09-13     50
2021-09-14    300
Name: Total_Clicks_Yesterday, dtype: int64

It's the easiest way to get a series as you desired, grouping and summing over a value. Although the name of the index is "date", I'm going to merge it as if it is the date of yesterday.

Now all we need to get this small series together with the original dataframe. I recommend taking a look at Pandas Merging 101 to understand it better. Basically, I'm merging based on the left dataframe (df), where I'm comparing the 'Date_Yesterday' column of df with the index of the groupby df (which to I refer to as "data of yesterday")

The result:

pd.merge(df, df.groupby('Date')['Clicks'].sum().rename('Total_Clicks_Yesterday'), how='left', left_on='Date_Yesterday', right_index=True)
Out[17]: 
         Date Date_Yesterday  Clicks  Total_Clicks_Yesterday
0  2021-09-14     2021-09-13     100                    50.0
1  2021-09-14     2021-09-13     100                    50.0
2  2021-09-14     2021-09-13     100                    50.0
3  2021-09-13     2021-09-12      50                    20.0
4  2021-09-12     2021-09-11      10                     1.0
5  2021-09-12     2021-09-11      10                     1.0
6  2021-09-11     2021-09-10       1                     NaN

Sum dataframe column by conditional row criteria

2 Answers2