Multiple Conditions Sum Between Two Pandas Dataframes

Question

I am new to Pandas and am having a hard time finding this sum. Suppose I have the following dataframes:

df1 = pd.DataFrame({'Product1': ['A', 'A', 'B'], 
                    'Date1': ['2019-02-01', '2019-12-15', '2019-03-01']})
df1['Date1'] = pd.to_datetime(df1['Date1'])
df1

Product1  Date1
A         2019-02-01
A         2019-12-15
B         2019-03-01

and

df2 = pd.DataFrame({'Product2': ['A', 'A', 'A', 'C', 'B', 'B'], 
                    'Date2': ['2019-01-01', '2019-02-01', '2019-06-01', '2019-06-30', '2019-01-31', '2019-02-15'],
                    'Payment': [100, 150, 100, 500, 25, 25]})
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2

Product2  Date2        Payment
A         2019-01-01   100
A         2019-02-01   150
A         2019-06-01   100
C         2019-06-30   500
B         2019-01-31   25
B         2019-02-15   25

I would like to obtain the following result

Product1  Date1        Total
A         2019-02-01   250
A         2019-12-15   350
B         2019-03-01   50

where df1['Total'] is the sum of df2['Payment'] when df2['Product2'] = df1['Product1'] and df2['Date2'] <= df1['Date1'].

The best I have done so far is using the conditional loop as follows

sum_list = []
for col1, col2 in zip(df1['Product1'].values, df1['Date1'].values):
    cond = (df2['Product2'] == col1) & (df2['Date2'] <= col2)
    sum_list.append(df2[cond]['Payment'].sum())

df1['Total'] = pd.DataFrame(sum_list)
df1

but it is extremely slow when dealing with million rows. I believe there must be a better way.

why do you need both tables? the information is redundant. are there rows in the second table with dates or products not in the first? — Moshel, Aug 18 '19 at 10:50
Can you add timings ? Also, pandas is not so great for real big data, PySpark might be better than chasing the syntax that will save you a few seconds in total — M.F, Aug 18 '19 at 10:59
@Moshel You can see for yourself there is product C in `df2` that doesn't appear in `df1` — Anastasiya-Romanova 秀, Aug 18 '19 at 11:04
@M.F I'm new to Python so probably I'll stick with Pandas for a while. Thanks for your advice — Anastasiya-Romanova 秀, Aug 18 '19 at 11:07

score 2 · Answer 1 · answered Aug 18 '19 at 10:55

2

Using merge, query, GroupBy.sum:

mrg = df1.merge(df2, left_on='Product1', right_on='Product2')

mrg.query('Date2 <= Date1').groupby(['Product1', 'Date1'], as_index=False)['Payment'].sum()

Output

  Product1      Date1  Payment
0        A 2019-02-01      250
1        A 2019-12-15      350
2        B 2019-03-01       50

answered Aug 18 '19 at 10:55

Erfan

40,971
8
66
78

This looks promising. Let me adapt it first to my original problem. +1. – Anastasiya-Romanova 秀 Aug 18 '19 at 11:02
Hi, is there any chance you can apply this method to answer [this question of mine](https://stackoverflow.com/q/58074962/3397819)? I've tried, but it didn't work. Would you be kind enough to give it a try, please? Thanks in advance. – Anastasiya-Romanova 秀 Sep 24 '19 at 10:45

score 1 · Answer 2 · answered Aug 18 '19 at 10:51

You can use a join upfront, and then filter out the rows according to your criteria. A solution can look something like this:

import pandas as pd
df1 = pd.DataFrame({'Product1': ['A', 'A', 'B'], 
                    'Date1': ['2019-02-01', '2019-12-15', '2019-03-01']})
df1['Date1'] = pd.to_datetime(df1['Date1'])

df2 = pd.DataFrame({'Product2': ['A', 'A', 'A', 'C', 'B', 'B'], 
                    'Date2': ['2019-01-01', '2019-02-01', '2019-06-01', '2019-06-30', '2019-01-31', '2019-02-15'],
                    'Payment': [100, 150, 100, 500, 25, 25]})
df2['Date2'] = pd.to_datetime(df2['Date2'])

#Join according to the products in both tables
temp = pd.merge(df1, df2, how='left', left_on='Product1', right_on='Product2')
#only keep those rows that meet the date criteria
temp = temp[temp['Date2'] <= temp['Date1']]
#aggregate results according to product and date.
out = temp.groupby(['Product1', 'Date1']).agg({'Payment':'sum'})
#Optional: If you wish to get back the columns normally
out = out.reset_index()

print(out.head())
  Product1      Date1  Payment
0        A 2019-02-01      250
1        A 2019-12-15      350
2        B 2019-03-01       50

Let me adapt your answer to the original problem of mine. +1 for now. — Anastasiya-Romanova 秀, Aug 18 '19 at 11:01

score 1 · Answer 3 · answered Aug 18 '19 at 14:39

Here is an approach that mirrors that in the OP. Rather than performing a one-to-many merge, this loops through each row of df1 and performs the check and calculations. It is likely a bit faster than the original as it employs a list comprehension rather than growing a list, though slower than a merge. The upside is that it will use less memory.

df1['Payment'] = \
   [df2.loc[(df1.iloc[i, 0] == df2.Product2) & (df1.iloc[i, 1] > df2.Date2),
            'Payment'].sum()
    for i in range(df1.shape[0])]

This returns the desired result

df1
  Product1      Date1  Payment
0        A 2019-02-01      100
1        A 2019-12-15      350
2        B 2019-03-01       50

I'm still looking the faster way since Erfan's answer is quite slow in my laptop (but way much faster than my solution). Thanks for the intuitive answer. +1 — Anastasiya-Romanova 秀, Aug 18 '19 at 15:19

Multiple Conditions Sum Between Two Pandas Dataframes

3 Answers3