1

I need to perform a merge to map a new set of ids to an old set of ids. My starting data looks like this:

lst = [10001, 20001, 30001]
dt = pd.date_range(start='2016', end='2018', freq='M')
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)

In [94]: df.head()
Out[94]:
                         0
date       id
2016-01-31 10001 -0.512371
           20001 -1.164461
           30001 -1.253232
2016-02-29 10001 -0.129874
           20001  0.711938

And I want to map id to newid using data that looks like this:

df1 = pd.DataFrame({'id': [10001, 10001, 10001, 10001],
'start_date': ['2015-11-31', '2016-02-01', '2016-05-16', '2017-02-16'],
'end_date': ['2016-01-31', '2016-05-15', '2017-02-15', '2018-04-02'],
'new_id': ['ABC123', 'XYZ789', 'HIJ456', 'LMN654']},)
df2 = pd.DataFrame({'id': [20001, 20001, 20001, 20001],
'start_date': ['2015-10-07', '2016-01-08', '2016-06-02', '2017-02-13'],
'end_date': ['2016-01-07', '2016-06-01', '2017-02-12', '2018-03-017'],
'new_id': ['CBA321', 'ZYX987', 'JIH765', 'NML345']},)
df3 = pd.DataFrame({'id': [30001, 30001, 30001, 30001],
'start_date': ['2015-07-31', '2016-02-23', '2016-06-17', '2017-05-12'],
'end_date': ['2016-02-22', '2016-06-16', '2017-05-11', '2018-01-05'],
'new_id': ['CCC333', 'XXX444', 'HHH888', 'III888']},)
df_ranges = pd.concat([df1,df2,df3])
In [95]: df_ranges.head()
Out[95]:
   index    end_date     id  new_id  start_date
0      0  2016-01-31  10001  ABC123  2015-11-31
1      1  2016-05-15  10001  XYZ789  2016-02-01
2      2  2017-02-15  10001  HIJ456  2016-05-16
3      3  2018-04-02  10001  LMN654  2017-02-16
4      0  2016-01-07  20001  CBA321  2015-10-07

Basically, my data is monthly panel data and the new data has ranges of dates for which a specific mapping from A->B is valid. So row 1 of the mapping data says that from 2016-01-31 through 2015-211-31 the id 10001 maps to ABC123.

I've previously done this in SAS/SQL with a statement like this:

SELECT a.*, b.newid FROM df as a, df_ranges as b 
WHERE a.id = b.id AND b.start_date <= a.date < b.end_date

A few notes about the data:

  1. it should be a 1:1 mapping of id to newid.
  2. the date ranges are non-overlapping

The solution here may be a good start: Merging dataframes based on date range

It is exactly what I'm looking for except that it merges only on dates, not additionally on id. I played with groupby() and this solution but didn't find a way to make it work. Another idea I had was to unstack() the mapping data (df_ranges) to match the dimensions/time frequency of df but this seems to simply re-state the existing problem.

Jesse Blocher
  • 523
  • 1
  • 4
  • 16
  • If your dataframes are too big, you can use a cartesian join and filter aferwards like this: `df.reset_index().merge(df_ranges, on = 'id').query('start_date < date < end_date')` – Scott Boston Jun 04 '18 at 16:32
  • The data is rather large, so I'd prefer to avoid a cartesian join. – Jesse Blocher Jun 04 '18 at 16:42

1 Answers1

2

Perhaps I got downvoted because this was too easy, but I couldn't find the answer anywhere so I'll just post it here: you should use the merge_asof() which provides fuzzy matching on dates. First, data need to be sorted:

df_ranges.sort_values(by=['start_date','id'],inplace=True)
df.sort_values(by=['date','id'],inplace=True)

Then, do the merge:

pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date')

Output:

In [30]: pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date').head()
Out[30]:
        date     id         0 start_date    end_date  new_id
0 2016-01-31  10001  0.120892 2015-11-30  2016-01-31  ABC123
1 2016-01-31  20001 -0.576096 2016-01-08  2016-06-01  ZYX987
2 2016-01-31  30001  0.543597 2015-07-31  2016-02-22  CCC333
3 2016-02-29  10001  0.316212 2016-02-01  2016-05-15  XYZ789
4 2016-02-29  20001 -0.625878 2016-01-08  2016-06-01  ZYX987
Jesse Blocher
  • 523
  • 1
  • 4
  • 16