I would like to calculate by customer_id
the num_opens_at_campaign_send
. This is dependant on the number of campaigns a customer has opened prior to each campaign being sent.
I can't figure out the best approach to do this in pandas, so any help would be appreciated. I'm thinking a groupby customer_id
and an apply
function to compare each campaign_sent
date with all other dates in that column, but I'm unsure of the precise method to get the row counts to calculate the number of campaigns a customer has opened at the time of each campaign send.
The dataframe is as follows:
customer_id | campaign_id | campaign_sent | opened |
---|---|---|---|
a | 1234 | 2021-01-01 | True |
b | 1234 | 2021-01-01 | True |
c | 1234 | 2021-01-01 | False |
a | 2222 | 2021-02-01 | True |
b | 2222 | 2021-02-01 | False |
c | 2222 | 2021-02-01 | True |
a | 3333 | 2021-03-01 | True |
b | 3333 | 2021-03-01 | False |
c | 3333 | 2021-03-01 | True |
The desired output is:
customer_id | campaign_id | campaign_sent | num_opens_at_campaign_send |
---|---|---|---|
a | 1234 | 2021-01-01 | 0 |
b | 1234 | 2021-01-01 | 0 |
c | 1234 | 2021-01-01 | 0 |
a | 2222 | 2021-02-01 | 1 |
b | 2222 | 2021-02-01 | 1 |
c | 2222 | 2021-02-01 | 0 |
a | 3333 | 2021-03-01 | 2 |
b | 3333 | 2021-03-01 | 1 |
c | 3333 | 2021-03-01 | 1 |
So for the first campaign num_opens_at_campaign_send
is 0 for all because there were no campaigns prior to it.
As an example customer_id
'b' has 1 open when campaign_id
3333 is sent because they opened the first campaign (1234) but did not open the second campaign (2222) email.