6

I want to merge two datasets that are indexed by time and id. The problem is, the time is slightly different in each dataset. In one dataset, the time (Monthly) is mid-month, so the 15th of every month. In the other dataset, it is the last business day. This should still be a one-to-one match, but the dates are not exactly the same.

My approach is to shift mid-month dates to business day end-of-month dates.

Data:

dt = pd.date_range('1/1/2011','12/31/2011', freq='D')
dt = dt[dt.day == 15]
lst = [1,2,3]
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
df.head()

output:

                      0
date       id
2011-01-15 1  -0.598584
           2  -0.484455
           3  -2.044912
2011-02-15 1  -0.017512
           2   0.852843

This is what I want (I removed the performance warning):

In[83]:df.index.levels[0] + BMonthEnd()
Out[83]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
               '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
              dtype='datetime64[ns]', freq='BM')

However, indexes are immutable, so this does not work:

In: df.index.levels[0] = df.index.levels[0] + BMonthEnd()

TypeError: 'FrozenList' does not support mutable operations.

The only solution I've got is to reset_index(), change the dates, then set_index() again:

df.reset_index(inplace=True)
df['date'] = df['date'] + BMonthEnd()
df.set_index(['date','id'], inplace=True)

This gives what I want, but is this the best way? Is there a set_level_values() function (I didn't see it in the API)?

Or maybe I'm taking the wrong approach to the merge. I could merge the dataset with keys df.index.get_level_values(0).year, df.index.get_level_values(0).month and id but this doesn't seem much better.

Jesse Blocher
  • 523
  • 1
  • 4
  • 16

3 Answers3

5

You can use set_levels in order to set multiindex levels:

df.index.set_levels(df.index.levels[0] + pd.tseries.offsets.BMonthEnd(),
                    level='date', inplace=True)


>>> df.head()
                      0
date       id          
2011-01-31 1  -1.410646
           2   0.642618
           3  -0.537930
2011-02-28 1  -0.418943
           2   0.983186
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • I'm out of votes, but I cleared one out and upvoted this because I like this solution as well. – cs95 May 30 '18 at 15:52
  • Thanks, much appreciated! I liked your solution as well! – sacuL May 30 '18 at 15:54
  • 1
    This is an excellent, clean solution to the problem of re-indexing, but I marked the solution below as the answer because it solved my overall problem of merging with better performance. By skipping the re-indexing, it was faster overall. – Jesse Blocher May 30 '18 at 16:10
5

You could just build it again:

df.index = pd.MultiIndex.from_arrays(
    [
       df.index.get_level_values(0) + BMonthEnd(),
       df.index.get_level_values(1)
])

set_levels implicitly rebuilds the index under the covers. If you have more than two levels, this solution becomes unweildy, so consider using set_levels for typing brevity.

cs95
  • 379,657
  • 97
  • 704
  • 746
2

Since you want to merge anyway, you can forget about changing the index and use use pandas.merge_asof()

Data

df1

                      0
date       id          
2011-01-15 1  -0.810581
           2   1.177235
           3   0.083883
2011-02-15 1   1.217419
           2  -0.970804
           3   1.262364
2011-03-15 1  -0.026136
           2  -0.036250
           3  -1.103929
2011-04-15 1  -1.303298

And here is one with last business day of the month, df2

                      0
date       id          
2011-01-31 1  -0.277675
           2   0.086539
           3   1.441449
2011-02-28 1   1.330212
           2  -0.028398
           3  -0.114297
2011-03-31 1  -0.031264
           2  -0.787093
           3  -0.133088
2011-04-29 1   0.938732

merge

Use df1 as your left DataFrame and then choose the merge direction as forward since the last business day is always after the 15th. Optionally, you can set a tolerance. This is useful in the situation where you are missing a month in the right DataFrame and will prevent you from merging 03-31-2011 to 02-15-2011 if you are missing data for the last business day February.

import pandas as pd
pd.merge_asof(df1.reset_index(), df2.reset_index(), by='id', on='date', 
              direction='forward', tolerance=pd.Timedelta(days=20)).set_index(['date', 'id'])

Results in

                    0_x       0_y
date       id                    
2011-01-15 1  -0.810581 -0.277675
           2   1.177235  0.086539
           3   0.083883  1.441449
2011-02-15 1   1.217419  1.330212
           2  -0.970804 -0.028398
           3   1.262364 -0.114297
2011-03-15 1  -0.026136 -0.031264
           2  -0.036250 -0.787093
           3  -1.103929 -0.133088
2011-04-15 1  -1.303298  0.938732
ALollz
  • 57,915
  • 7
  • 66
  • 89