4

Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.

Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.

df_emails includes the date we sent a customer an email and it looks like this:

   CustID   DateSent
0       2 2018-01-20
1       2 2018-02-19
2       2 2018-03-31
3       4 2018-01-10
4       4 2018-02-26
5       5 2018-02-01
6       5 2018-02-07

df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:

   CustID   TripDate  TotalSpend
0       2 2018-02-04          25
1       2 2018-02-16         100
2       2 2018-02-22         250
3       4 2018-01-03          50
4       4 2018-02-28         100
5       4 2018-03-21         100
6       8 2018-01-07         200

Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:

   CustID   DateSent NextDateSentOrEndOfData  TripsBetween  TotalSpendBetween
0       2 2018-01-20              2018-02-19           2.0              125.0
1       2 2018-02-19              2018-03-31           1.0              250.0
2       2 2018-03-31              2018-04-01           0.0                0.0
3       4 2018-01-10              2018-02-26           0.0                0.0
4       4 2018-02-26              2018-04-01           2.0              200.0
5       5 2018-02-01              2018-02-07           0.0                0.0
6       5 2018-02-07              2018-04-01           0.0                0.0

Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:

df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1

df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')

df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))

for i in df_emails.index:
    df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])

for i in df_emails.index:
    df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()

df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)

However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?

  • @QuangHoang 2018-04-01 is the last/most recent date of the df_trips data. Thus, there can be no trips or money spent after 2018-04-01, so at the row of the last email to each patron, NextDateSentOrEndOfData will be 2018-04-01. – bryanthales Jul 05 '19 at 16:33

2 Answers2

2

This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:

max_date = pd.to_datetime('2018-04-01')

# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)

# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)

# cuts function for groupby
def cuts(df):
    custID = df.CustID.iloc[0]
    bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
    return pd.cut(df.TripDate, bins=bins, right=False)

# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)

# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
                  .TotalSpend.agg(['sum', 'size'])
                  .reset_index()
         )

# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)

# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)

# merge:
df_emails.reset_index().merge(new_df, 
                on=['CustID','NextDateSentOrEndOfData'],
                              how='left'
                )

Output:

   CustID   DateSent NextDateSentOrEndOfData    sum  size
0       2 2018-01-20              2018-02-19  125.0   2.0
1       2 2018-02-19              2018-03-31  250.0   1.0
2       2 2018-03-31              2018-04-01    NaN   NaN
3       4 2018-01-10              2018-02-26    NaN   NaN
4       4 2018-02-26              2018-04-01  200.0   2.0
5       5 2018-02-01              2018-02-07    NaN   NaN
6       5 2018-02-07              2018-04-01    NaN   NaN
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks for taking the time to answer. Don't know if I am making a beginners mistake with the implementation of the code, but when I tried your solution myself I got an "KeyError: "None of [Int64Index([8], dtype='int64', name='CustID')] are in the [index]" ". Seems to be occurring in the 'bin the dates' step... – bryanthales Jul 05 '19 at 17:27
2

Add the next date column to emails

df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)

Sort for merge_asof and then merge to nearest to create a trip lookup table

df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")

Aggregate the lookup table for the data you want.

df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])

Left join it back to the email table.

df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")

I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)

   CustID   DateSent NextDateSent  (TotalSpend, count)  (TotalSpend, sum)
0       2 2018-01-20   2018-02-19                  2.0              125.0
1       2 2018-02-19   2018-03-31                  1.0              250.0
2       2 2018-03-31          NaT                  NaN                NaN
3       4 2018-01-10   2018-02-26                  NaN                NaN
4       4 2018-02-26          NaT                  2.0              200.0
5       5 2018-02-01   2018-02-07                  NaN                NaN
6       5 2018-02-07          NaT                  NaN                NaN
CJR
  • 3,916
  • 2
  • 10
  • 23
  • @CJR Thank you so much. This is working for me. One question though. I get a "UserWarning: merging between different levels can give an unintended result (1 levels on the left, 2 on the right) warnings.warn(msg, UserWarning)" after implementation. Can this be safely ignored? – bryanthales Jul 05 '19 at 18:24
  • For this, I believe so - it's occurring because the `df_lookup` dataframe has MultiIndexes as a result of the aggregation. You could explicitly pull those indexes apart into separate columns before you join it back to `df_emails` to suppress the warning (which would probably be a good practice anyway). – CJR Jul 05 '19 at 18:30
  • Thanks again for the reply. Used the `df_lookup.columns = df_lookup.columns.get_level_values(0)` before the final merge and the UserWarning was gone. – bryanthales Jul 09 '19 at 15:12