I have two dataframes that need a complex join operation going beyond panda's merge capability.
Dataframe 1 is a set of account records, where there are multiple GUIDs mapping to individual RIDS. However, they are guaranteed to have nonoverlapping start and end dates.
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=4, freq='5D'),
'EndDate': pd.date_range('2010-01-04', periods=4, freq='5D'),
'rid': ['A', 'A', 'A', 'B'],
'GUID': ['1', '2', '3', '4']})
>>> df1
StartDate EndDate rid GUID
0 2010-01-01 2010-01-04 A 1
1 2010-01-06 2010-01-09 A 2
2 2010-01-11 2010-01-14 A 3
3 2010-01-01 2010-01-19 B 4
Dataframe 2 is a set of transactions by date and RID.
rid_list = [random.choice(("A", "B")) for x in range(50)]
df2 = pd.DataFrame(dict(values=np.random.randn(
50), date_time=pd.date_range('2010-01-01', periods=50, freq='D'), rid=rid_list))
>>> df2.head()
values date_time rid
0 -0.214056 2010-01-01 A
1 0.168259 2010-01-02 A
2 -1.214433 2010-01-03 B
3 0.314966 2010-01-04 A
4 1.953925 2010-01-05 B
5 -0.027883 2010-01-06 A
6 -0.207795 2010-01-07 B
7 0.530119 2010-01-08 A
8 -0.297716 2010-01-09 B
9 2.080151 2010-01-10 B
I need to associate the correct GUID to each one, but have been unable to do so with pandas. For example, the very first row of a correctly joined df1 and df2 will have a GUID of 1, since it is A and occurs within GUID 1's timeframe.
>>> df2.head()
values date_time rid GUID
0 -0.214056 2010-01-01 A 1
1 0.168259 2010-01-02 A 1
2 -1.214433 2010-01-03 B 4
3 0.314966 2010-01-04 A 1
4 1.953925 2010-01-05 B 4
5 -0.027883 2010-01-06 A 2
6 -0.207795 2010-01-07 B 4
7 0.530119 2010-01-08 A 2
8 -0.297716 2010-01-09 B 4
9 2.080151 2010-01-10 B 4
From this thread, Merging dataframes based on date range, I thought a np.piecewise solution would work:
df2['GUID'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date) & (
df2.date_time.values <= end_date) & (df2.rid == rid) for start_date, end_date, rid in zip(df1.StartDate.values, df1.EndDate.values, df1.rid.values)], df1.GUID.values)
Unfortunately this returns:
ValueError: with 1 condition(s), either 1 or 2 functions are expected
Any ideas how to do this? I used pandasql in the past to handle this, but encountered scalability issues around it.