I am new to Pandas and am having a hard time finding this sum. Suppose I have the following dataframes:
df1 = pd.DataFrame({'Product1': ['A', 'A', 'B'],
'Date1': ['2019-02-01', '2019-12-15', '2019-03-01']})
df1['Date1'] = pd.to_datetime(df1['Date1'])
df1
Product1 Date1
A 2019-02-01
A 2019-12-15
B 2019-03-01
and
df2 = pd.DataFrame({'Product2': ['A', 'A', 'A', 'C', 'B', 'B'],
'Date2': ['2019-01-01', '2019-02-01', '2019-06-01', '2019-06-30', '2019-01-31', '2019-02-15'],
'Payment': [100, 150, 100, 500, 25, 25]})
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2
Product2 Date2 Payment
A 2019-01-01 100
A 2019-02-01 150
A 2019-06-01 100
C 2019-06-30 500
B 2019-01-31 25
B 2019-02-15 25
I would like to obtain the following result
Product1 Date1 Total
A 2019-02-01 250
A 2019-12-15 350
B 2019-03-01 50
where df1['Total']
is the sum of df2['Payment']
when df2['Product2'] = df1['Product1']
and df2['Date2'] <= df1['Date1']
.
The best I have done so far is using the conditional loop as follows
sum_list = []
for col1, col2 in zip(df1['Product1'].values, df1['Date1'].values):
cond = (df2['Product2'] == col1) & (df2['Date2'] <= col2)
sum_list.append(df2[cond]['Payment'].sum())
df1['Total'] = pd.DataFrame(sum_list)
df1
but it is extremely slow when dealing with million rows. I believe there must be a better way.