0

I would like to calculate by customer_id the num_opens_at_campaign_send. This is dependant on the number of campaigns a customer has opened prior to each campaign being sent.

I can't figure out the best approach to do this in pandas, so any help would be appreciated. I'm thinking a groupby customer_id and an apply function to compare each campaign_sent date with all other dates in that column, but I'm unsure of the precise method to get the row counts to calculate the number of campaigns a customer has opened at the time of each campaign send.

The dataframe is as follows:

customer_id campaign_id campaign_sent opened
a 1234 2021-01-01 True
b 1234 2021-01-01 True
c 1234 2021-01-01 False
a 2222 2021-02-01 True
b 2222 2021-02-01 False
c 2222 2021-02-01 True
a 3333 2021-03-01 True
b 3333 2021-03-01 False
c 3333 2021-03-01 True

The desired output is:

customer_id campaign_id campaign_sent num_opens_at_campaign_send
a 1234 2021-01-01 0
b 1234 2021-01-01 0
c 1234 2021-01-01 0
a 2222 2021-02-01 1
b 2222 2021-02-01 1
c 2222 2021-02-01 0
a 3333 2021-03-01 2
b 3333 2021-03-01 1
c 3333 2021-03-01 1

So for the first campaign num_opens_at_campaign_send is 0 for all because there were no campaigns prior to it.

As an example customer_id 'b' has 1 open when campaign_id 3333 is sent because they opened the first campaign (1234) but did not open the second campaign (2222) email.

TheDataFox
  • 172
  • 1
  • 2
  • 15

1 Answers1

2

You can use .groupby.transform to compute the num_opens_at_campaign_send column. First group by the customer_id, now num_opens_at_campaign_send for a row is cumulative sum of opened till the row - value of opened in that row.

To make sure that cumulative sum is computed in the right order of date, first sort the dataframe by campaign_sent column.

Use:

df = df.sort_values(by = ['campaign_sent'])
df['num_opens_at_campaign_send'] = ( df.groupby('customer_id')['opened']
                                       .cumsum() - df.opened)
df.drop(columns = 'opened', inplace = True)

Output:

>>> df
  customer_id  campaign_id campaign_sent  num_opens_at_campaign_send
0           a         1234    2021-01-01                           0
1           b         1234    2021-01-01                           0
2           c         1234    2021-01-01                           0
3           a         2222    2021-02-01                           1
4           b         2222    2021-02-01                           1
5           c         2222    2021-02-01                           0
6           a         3333    2021-03-01                           2
7           b         3333    2021-03-01                           1
8           c         3333    2021-03-01                           1
Amit Vikram Singh
  • 2,090
  • 10
  • 20
  • You would have to make sure that the df was date sorted in ascending order on the `campaign_sent` column for this to work right? – TheDataFox Apr 21 '21 at 08:03
  • Thanks for the help. How would this be done if you wanted `num_opens_at_campaign_send` for the last n days per `customer_id`? For example the last 30 days? Some form of rolling function? Again I'm struggling to get the right method. – TheDataFox Apr 21 '21 at 13:30
  • I think resample will be needed as the data does have records for everyday – TheDataFox Apr 21 '21 at 13:58
  • This may help: [Time aware rolling in pandas](https://stackoverflow.com/a/41643179/7831421) – Amit Vikram Singh Apr 21 '21 at 21:19