3

I have two dataframes, the first one:

KEY DATE               VALUE
A   01/01/2020         1
A   05/01/2020         3
B   01/02/2020         1
C   05/02/2020         1

The second dataframe:

KEY    START_DATE     END_DATE     VALUE2
A      01/01/2020     03/01/2020   3
B      01/03/2020     06/03/2020   4
C      01/02/2020     07/02/2020   5

I'd like to merge these two by KEY but also with the condition that DATE from the first dataframe is included in the window START_DATE - END_DATE in the second dataframe.

Expected result:

KEY   DATE          START_DATE    END_DATE     VALUE     VALUE2
A     01/01/2020    01/01/2020    03/01/2020   1         3
A     05/01/2020    NAT           NAT          1         NAN
B     01/02/2020    NAT           NAT          1         NAN
C     05/02/2020    01/02/2020    07/01/2020   1         5

I tried this, but this method can be very memory consuming if the KEY has multiple entries in the second dataframe:

df = df1.merge(df2, on="KEY", how="left")
idx = (df.DATE >= df.START_DATE) & (df.DATE <= df.END_DATE)
df = df[idx] 
deltascience
  • 3,321
  • 5
  • 42
  • 71

1 Answers1

0

If all 3 "DATE..." columns are of string type, then start from converting them to datetime:

df1.DATE = pd.to_datetime(df1.DATE, dayfirst=True)
df2.START_DATE = pd.to_datetime(df2.START_DATE, dayfirst=True)
df2.END_DATE = pd.to_datetime(df2.END_DATE, dayfirst=True)

Then, one of possible solutions is to:

  • merge both DataFrames on A column, for now ignoring the second requirement (and reorder columns),
  • in rows with DATE outside the required range, set NaN in three columns of interest.

The code to do it is:

res = df1.merge(df2, on='KEY', how="left")\
    [['KEY', 'DATE', 'START_DATE', 'END_DATE', 'VALUE', 'VALUE2']]
msk = res.DATE.between(res.START_DATE, res.END_DATE)
res[['START_DATE', 'END_DATE', 'VALUE2']] = res[['START_DATE', 'END_DATE', 'VALUE2']]\
   .where(msk)
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41