2

I have two dataframes, df1 and df2:

import pandas as pd
import datetime
import numpy as np
df1 = pd.DataFrame()
df1["StartDate"] = pd.date_range(start=('2016/01/01'), end = ('2017/12/31'), 
freq="6D")
df1["EndDate"] = df1["StartDate"] + datetime.timedelta(5)
df1["Value_three"] =  pd.Series(np.random.randn(len(df.Date))+2)

df1 has a start date, end date and value (named value_three). df2 has a date, value_one and value two.

df2 = pd.DataFrame()
df2["Date"] = pd.date_range(start=('2016/01/01'), end=('2017/12/31'), freq="D")
df2["Value_one"] = pd.Series(np.random.randn(len(df.Date)))
df2["Value_two"] = pd.Series(np.random.randn(len(df.Date))+1)

What am I looking to achieve is something similar to the SUMIFS function in Excel:

SUMIFS

In this case, I'd like add two columns to df1 that are values_1 and value_2 summed respectively where the "Date" column is df2 is between df1.StartDate and df1.EndDate.

Let's assume it is not possible to pre-aggregate this in Vertica/MSSQL/etc. We also cannot use resample as the start date and end date can be irregular (in this case, separated by 11 days, but the gap can be ever changing).

I've tried join, merge, melt, concat, and I am proving a failure! Hopefully someone can kindly assist.

SJEL
  • 17
  • 3
  • 2
    Welcome to SO. Please look here for how you can [edit](https://stackoverflow.com/posts/50262171/edit) your question to make it clearer: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). For example, do not use links or images; include data as text in your question. – jpp May 09 '18 at 21:18
  • 1
    Thank you, I will review now and update. Apologies, thought images would be the easiest way :) – SJEL May 09 '18 at 21:47
  • 2
    No problem, the main issue is we can't copy-paste images into an interpreter. So writing and testing some code becomes difficult. – jpp May 09 '18 at 21:48

1 Answers1

0

This may not the most efficient, but it works! I'd love to learn about the most efficient way to do it too!

Solution:

def getForDF1(row):
    row['Value_1'],row['Value_2']= \
    df2[(df2['Date']>=row['StartDate']) & (df2['Date']<=row['EndDate'])].sum()[['Value_1','Value_2']]
    return row 

df1=df1.apply(getForDF1,axis=1)

P.S. Your code does not seem to work as timedelta cannot work for the whole series.

df1["EndDate"] = df1["StartDate"] + datetime.timedelta(5)

To make the illustration easier, I changed the values of your df:

Initial setup:

import pandas as pd

StartDate=['2016-01-01','2016-01-13','2016-01-25','2016-02-06','2016-02-18']
EndDate=['2016-01-12','2016-01-24','2016-01-05','2016-02-17','2016-02-29']
value_3=[1,2,3,4,5]

Date=['2016-01-01','2016-01-02','2016-02-10','2016-02-11','2016-02-18']
value_1=[3,4,5,6,7]
value_2=[0,1,3,5,7]

df1=pd.DataFrame({'StartDate':StartDate,'EndDate':EndDate,'Value_3':value_3})
df2=pd.DataFrame({'Date':Date,'Value_1':value_1,'Value_2':value_2})

df1['EndDate']=pd.to_datetime(df1['EndDate'])
df1['StartDate']=pd.to_datetime(df1['StartDate'])
df2['Date']=pd.to_datetime(df2['Date'])
Matthew
  • 369
  • 3
  • 17