Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe

Question

Consider the following DataFrame:

                          value
item_uid   created_at          

0S0099v8iI 2015-03-25  10652.79
0F01ddgkRa 2015-03-25   1414.71
0F02BZeTr6 2015-03-20  51505.22
           2015-03-23  51837.97
           2015-03-24  51578.63
           2015-03-25       NaN
           2015-03-26       NaN
           2015-03-27  50893.42
0F02BcIzNo 2015-03-17   1230.00
           2015-03-23   1130.00
0F02F4gAMs 2015-03-25   1855.96
0F02Vwd6Ou 2015-03-19   5709.33
0F04OlAs0R 2015-03-18    321.44
0F05GInfPa 2015-03-16    664.68
0F05PQARFJ 2015-03-18   1074.31
           2015-03-26   1098.31
0F06LFhBCK 2015-03-18    211.49
0F06ryso80 2015-03-16     13.73
           2015-03-20     12.00
0F07gg7Oth 2015-03-19   2325.70

I need to sample the full dataframe between two dates start_date and end_date on every date between them, propagating the last seen value. The sampling should be done within each item_uid independently/separately.

For example, if we were to sample between 2015-03-20 and 2015-03-29 for 0F02BZeTr6, we should get:

0F02BZeTr6 2015-03-20  51505.22
           2015-03-21  51505.22
           2015-03-22  51505.22
           2015-03-23  51837.97
           2015-03-24  51578.63
           2015-03-25  51578.63
           2015-03-26  51578.63
           2015-03-27  50893.42
           2015-03-28  50893.42
           2015-03-29  50893.42

Note that I am forward filling both NaN and missing entries in the dataframe.

This other question addresses a similar problem, but only with one group (i.e. one level). This question instead asks how to do the same but within each group (item_uid) separately. While I could split the input dataframe and iterate through each of the groups (each of the item_uid), and then stitch together the result, I am wondering if there is anything more efficient.

When I do the following (see this PR):

dates         = pd.date_range(start=start_date, end=end_date)    
df.groupby(level='itemuid').apply(lambda x: x.reindex(dates, method='ffill'))

I get:

TypeError: Fill method not supported if level passed

score 7 · Accepted Answer · answered Apr 02 '15 at 02:45

You have a couple of options, the easiest IMO is to simply unstack the first level and then ffill. I think this make it much clearer about what's going on than a groupby/resample solution (I suspect it will also be faster, depending on the data):

In [11]: df1['value'].unstack(0)
Out[11]:
item_uid    0F01ddgkRa  0F02BZeTr6  0F02BcIzNo  0F02F4gAMs  0F02Vwd6Ou  0F04OlAs0R  0F05GInfPa  0F05PQARFJ  0F06LFhBCK  0F06ryso80  0F07gg7Oth  0S0099v8iI
created_at
2015-03-16         NaN         NaN         NaN         NaN         NaN         NaN      664.68         NaN         NaN       13.73         NaN         NaN
2015-03-17         NaN         NaN        1230         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-18         NaN         NaN         NaN         NaN         NaN      321.44         NaN     1074.31      211.49         NaN         NaN         NaN
2015-03-19         NaN         NaN         NaN         NaN     5709.33         NaN         NaN         NaN         NaN         NaN      2325.7         NaN
2015-03-20         NaN    51505.22         NaN         NaN         NaN         NaN         NaN         NaN         NaN       12.00         NaN         NaN
2015-03-23         NaN    51837.97        1130         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-24         NaN    51578.63         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-25     1414.71         NaN         NaN     1855.96         NaN         NaN         NaN         NaN         NaN         NaN         NaN    10652.79
2015-03-26         NaN         NaN         NaN         NaN         NaN         NaN         NaN     1098.31         NaN         NaN         NaN         NaN
2015-03-27         NaN    50893.42         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN

If you're missing some dates you have to reindex (assuming the start and end are present, otherwise you can do this manually e.g. with pd.date_range):

In [12]: df1['value'].unstack(0).asfreq('D')
Out[12]:
item_uid    0F01ddgkRa  0F02BZeTr6  0F02BcIzNo  0F02F4gAMs  0F02Vwd6Ou  0F04OlAs0R  0F05GInfPa  0F05PQARFJ  0F06LFhBCK  0F06ryso80  0F07gg7Oth  0S0099v8iI
2015-03-16         NaN         NaN         NaN         NaN         NaN         NaN      664.68         NaN         NaN       13.73         NaN         NaN
2015-03-17         NaN         NaN        1230         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-18         NaN         NaN         NaN         NaN         NaN      321.44         NaN     1074.31      211.49         NaN         NaN         NaN
2015-03-19         NaN         NaN         NaN         NaN     5709.33         NaN         NaN         NaN         NaN         NaN      2325.7         NaN
2015-03-20         NaN    51505.22         NaN         NaN         NaN         NaN         NaN         NaN         NaN       12.00         NaN         NaN
2015-03-21         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-22         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-23         NaN    51837.97        1130         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-24         NaN    51578.63         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-25     1414.71         NaN         NaN     1855.96         NaN         NaN         NaN         NaN         NaN         NaN         NaN    10652.79
2015-03-26         NaN         NaN         NaN         NaN         NaN         NaN         NaN     1098.31         NaN         NaN         NaN         NaN
2015-03-27         NaN    50893.42         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN

Note: asfreq drops the name of the index (which is most likely a bug!)

Now you can ffill:

In [13]: df1['value'].unstack(0).asfreq('D').ffill()
Out[13]:
item_uid    0F01ddgkRa  0F02BZeTr6  0F02BcIzNo  0F02F4gAMs  0F02Vwd6Ou  0F04OlAs0R  0F05GInfPa  0F05PQARFJ  0F06LFhBCK  0F06ryso80  0F07gg7Oth  0S0099v8iI
2015-03-16         NaN         NaN         NaN         NaN         NaN         NaN      664.68         NaN         NaN       13.73         NaN         NaN
2015-03-17         NaN         NaN        1230         NaN         NaN         NaN      664.68         NaN         NaN       13.73         NaN         NaN
2015-03-18         NaN         NaN        1230         NaN         NaN      321.44      664.68     1074.31      211.49       13.73         NaN         NaN
2015-03-19         NaN         NaN        1230         NaN     5709.33      321.44      664.68     1074.31      211.49       13.73      2325.7         NaN
2015-03-20         NaN    51505.22        1230         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-21         NaN    51505.22        1230         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-22         NaN    51505.22        1230         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-23         NaN    51837.97        1130         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-24         NaN    51578.63        1130         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-25     1414.71    51578.63        1130     1855.96     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7    10652.79
2015-03-26     1414.71    51578.63        1130     1855.96     5709.33      321.44      664.68     1098.31      211.49       12.00      2325.7    10652.79
2015-03-27     1414.71    50893.42        1130     1855.96     5709.33      321.44      664.68     1098.31      211.49       12.00      2325.7    10652.79

and stack it back (Note: you can dropna=False if you want to include the starting NaN):

In [14]: s = df1['value'].unstack(0).asfreq('D').ffill().stack()

Note: If you the ordering of the index is important you can switch/sort it:

In [15]: s.index = s.index.swaplevel(0, 1)

In [16]: s = s.sort_index()

In [17]: s.index.names = ['item_uid', 'created_at']  # as this is lost earlier

In [18]: s
Out[18]:
item_uid
0F01ddgkRa  2015-03-25     1414.71
            2015-03-26     1414.71
            2015-03-27     1414.71
0F02BZeTr6  2015-03-20    51505.22
            2015-03-21    51505.22
            2015-03-22    51505.22
            2015-03-23    51837.97
            2015-03-24    51578.63
            2015-03-25    51578.63
            2015-03-26    51578.63
            2015-03-27    50893.42
...
0S0099v8iI  2015-03-25    10652.79
            2015-03-26    10652.79
            2015-03-27    10652.79
Length: 100, dtype: float64

Whether this is more efficient than a groupby/resample apply solution will depend on the data. For very sparse data (with lots of starting up NaN, assuming you want to drop these) I suspect it won't be as fast. If the data is dense (or you want to keep the initial NaN) I suspect this solution should be faster.

Thanks Andy. Very helpful! And the solution makes perfect sense. As for the `asfreq('D')` did you open a GitHub issue as a bug? Let me know otherwise, I can help with that. — Amelio Vazquez-Reina, Apr 07 '15 at 13:44
Please [do](https://github.com/pydata/pandas/issues) :) I haven't checked on master whether this is the still the case. But if so it hopefully should be an easy fix! — Andy Hayden, Apr 07 '15 at 21:08

Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe

1 Answers1

Linked