2

I have two dataframes that need a complex join operation going beyond panda's merge capability.

Dataframe 1 is a set of account records, where there are multiple GUIDs mapping to individual RIDS. However, they are guaranteed to have nonoverlapping start and end dates.

df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=4, freq='5D'),
                    'EndDate': pd.date_range('2010-01-04', periods=4, freq='5D'),
                    'rid': ['A', 'A', 'A', 'B'],
                    'GUID': ['1', '2', '3', '4']})
>>> df1

        StartDate   EndDate    rid  GUID
    0   2010-01-01  2010-01-04  A   1
    1   2010-01-06  2010-01-09  A   2
    2   2010-01-11  2010-01-14  A   3
    3   2010-01-01  2010-01-19  B   4

Dataframe 2 is a set of transactions by date and RID.

rid_list = [random.choice(("A", "B")) for x in range(50)]

df2 = pd.DataFrame(dict(values=np.random.randn(
    50), date_time=pd.date_range('2010-01-01', periods=50, freq='D'), rid=rid_list))

>>> df2.head()

         values     date_time   rid 
    0   -0.214056   2010-01-01  A
    1   0.168259    2010-01-02  A
    2   -1.214433   2010-01-03  B
    3   0.314966    2010-01-04  A
    4   1.953925    2010-01-05  B
    5   -0.027883   2010-01-06  A
    6   -0.207795   2010-01-07  B
    7   0.530119    2010-01-08  A
    8   -0.297716   2010-01-09  B
    9   2.080151    2010-01-10  B

I need to associate the correct GUID to each one, but have been unable to do so with pandas. For example, the very first row of a correctly joined df1 and df2 will have a GUID of 1, since it is A and occurs within GUID 1's timeframe.

>>> df2.head()

         values     date_time   rid  GUID
    0   -0.214056   2010-01-01  A    1
    1   0.168259    2010-01-02  A    1
    2   -1.214433   2010-01-03  B    4
    3   0.314966    2010-01-04  A    1
    4   1.953925    2010-01-05  B    4
    5   -0.027883   2010-01-06  A    2
    6   -0.207795   2010-01-07  B    4
    7   0.530119    2010-01-08  A    2
    8   -0.297716   2010-01-09  B    4
    9   2.080151    2010-01-10  B    4

From this thread, Merging dataframes based on date range, I thought a np.piecewise solution would work:

df2['GUID'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date) & (
    df2.date_time.values <= end_date) & (df2.rid == rid) for start_date, end_date, rid in zip(df1.StartDate.values, df1.EndDate.values, df1.rid.values)], df1.GUID.values)

Unfortunately this returns:

ValueError: with 1 condition(s), either 1 or 2 functions are expected

Any ideas how to do this? I used pandasql in the past to handle this, but encountered scalability issues around it.

AZhao
  • 13,617
  • 7
  • 31
  • 54
  • Groupby then https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range/46526249#46526249 – BENY May 21 '19 at 22:20
  • Also is df1 very big ? – BENY May 21 '19 at 22:21

2 Answers2

3

Here is one way 1st we unnesting the df1 , after create the date-range, then we just need to merge

df1['date_time']=[pd.date_range(x,y) for x , y in zip(df1.StartDate,df1.EndDate)]
df2=df2.merge(unnesting(df1,['date_time']).drop(['StartDate','EndDate'],1),how='left')

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx
    return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234
0

WeNYoBen's answer works, but is so complex (or elegant) that we had a hard time justifying putting it into our codebase.

We ended up with a hackier, simpler solution that is likely much less memory efficient. It creates a larger merged df on rid, and then conducts the filtering.

df3 = df2.merge(df1, on='rid')
df3[(df3['date_time'] <= df3['EndDate']) &
    (df3['StartDate'] >= df3['date_time'])]
AZhao
  • 13,617
  • 7
  • 31
  • 54