Row Count in Group By if Date is Before the Current Row being Evaluated

Question

I would like to calculate by customer_id the num_opens_at_campaign_send. This is dependant on the number of campaigns a customer has opened prior to each campaign being sent.

I can't figure out the best approach to do this in pandas, so any help would be appreciated. I'm thinking a groupby customer_id and an apply function to compare each campaign_sent date with all other dates in that column, but I'm unsure of the precise method to get the row counts to calculate the number of campaigns a customer has opened at the time of each campaign send.

The dataframe is as follows:

customer_id	campaign_id	campaign_sent	opened
a	1234	2021-01-01	True
b	1234	2021-01-01	True
c	1234	2021-01-01	False
a	2222	2021-02-01	True
b	2222	2021-02-01	False
c	2222	2021-02-01	True
a	3333	2021-03-01	True
b	3333	2021-03-01	False
c	3333	2021-03-01	True

The desired output is:

customer_id	campaign_id	campaign_sent	num_opens_at_campaign_send
a	1234	2021-01-01	0
b	1234	2021-01-01	0
c	1234	2021-01-01	0
a	2222	2021-02-01	1
b	2222	2021-02-01	1
c	2222	2021-02-01	0
a	3333	2021-03-01	2
b	3333	2021-03-01	1
c	3333	2021-03-01	1

So for the first campaign num_opens_at_campaign_send is 0 for all because there were no campaigns prior to it.

As an example customer_id 'b' has 1 open when campaign_id 3333 is sent because they opened the first campaign (1234) but did not open the second campaign (2222) email.

Amit Vikram Singh · Accepted Answer · 2021-04-21T08:17:24.070

You can use .groupby.transform to compute the num_opens_at_campaign_send column. First group by the customer_id, now num_opens_at_campaign_send for a row is cumulative sum of opened till the row - value of opened in that row.

To make sure that cumulative sum is computed in the right order of date, first sort the dataframe by campaign_sent column.

Use:

df = df.sort_values(by = ['campaign_sent'])
df['num_opens_at_campaign_send'] = ( df.groupby('customer_id')['opened']
                                       .cumsum() - df.opened)
df.drop(columns = 'opened', inplace = True)

Output:

>>> df
  customer_id  campaign_id campaign_sent  num_opens_at_campaign_send
0           a         1234    2021-01-01                           0
1           b         1234    2021-01-01                           0
2           c         1234    2021-01-01                           0
3           a         2222    2021-02-01                           1
4           b         2222    2021-02-01                           1
5           c         2222    2021-02-01                           0
6           a         3333    2021-03-01                           2
7           b         3333    2021-03-01                           1
8           c         3333    2021-03-01                           1

You would have to make sure that the df was date sorted in ascending order on the `campaign_sent` column for this to work right? — TheDataFox, Apr 21 '21 at 08:03
Thanks for the help. How would this be done if you wanted `num_opens_at_campaign_send` for the last n days per `customer_id`? For example the last 30 days? Some form of rolling function? Again I'm struggling to get the right method. — TheDataFox, Apr 21 '21 at 13:30
I think resample will be needed as the data does have records for everyday — TheDataFox, Apr 21 '21 at 13:58
This may help: [Time aware rolling in pandas](https://stackoverflow.com/a/41643179/7831421) — Amit Vikram Singh, Apr 21 '21 at 21:19

Row Count in Group By if Date is Before the Current Row being Evaluated

1 Answers1