5

I am working with a multi index data frame that has a date column and location_id as indices.

index_1 = ['2020-01-01', '2020-01-03', '2020-01-04']
index_2 = [100,200,300]

index = pd.MultiIndex.from_product([index_1, 
index_2], names=['Date', 'location_id'])

df = pd.DataFrame(np.random.randint(10,100,9), index)
df

                         0
Date       location_id    
2020-01-01 100          19
           200          75
           300          39
2020-01-03 100          11
           200          91
           300          80
2020-01-04 100          36
           200          56
           300          54

I want to fill in missing dates, with just one location_id and fill it with 0:

                         0
Date       location_id    
2020-01-01 100          19
           200          75
           300          39
2020-01-02 100          0
2020-01-03 100          11
           200          91
           300          80
2020-01-04 100          36
           200          56
           300          54

How can I achieve that? This is helpful but only if my data frame was not multi indexed.

Rob
  • 241
  • 1
  • 14

3 Answers3

4

you can get unique value of the Date index level, generate all dates between min and max with pd.date_range and use difference with unique value of Date to get the missing one. Then reindex df with the union of the original index and a MultiIndex.from_product made of missing date and the min of the level location_id.

#unique dates
m = df.index.unique(level=0)
# reindex
df = df.reindex(df.index.union(
                   pd.MultiIndex.from_product([pd.date_range(m.min(), m.max())
                                                .difference(pd.to_datetime(m))
                                                .strftime('%Y-%m-%d'), 
                                             [df.index.get_level_values(1).min()]])), 
                fill_value=0)
print(df)
                 0
2020-01-01 100  91
           200  49
           300  19
2020-01-02 100   0
2020-01-03 100  41
           200  25
           300  51
2020-01-04 100  44
           200  40
           300  54

instead of pd.MultiIndex.from_product, you can also use product from itertools. Same result but maybe faster.

from itertools import product
df = df.reindex(df.index.union(
                  list(product(pd.date_range(m.min(), m.max())
                                 .difference(pd.to_datetime(m))
                                 .strftime('%Y-%m-%d'),
                               [df.index.get_level_values(1).min()]))),
                fill_value=0)
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • Let's say my date is column is by hour increments instead of day, then I need to change `('%Y-%m-%d')` to `('%Y-%m-%d-%h')`. Is that correct? – Rob Jun 24 '20 at 21:57
  • @Rob like this I would say yes, I used `'%Y-%m-%d'` because it matches your original index, so the one you used should match the format of the original index – Ben.T Jun 24 '20 at 21:59
  • 1
    @Rob and laso you need to specify the freq in date_range, like `pd.date_range(m.min(), m.max(), freq='1H)`, the day is the default freq – Ben.T Jun 24 '20 at 22:00
3

Pandas index is immutable, so you need to construct a new index. Put index level location_id to column and get unique rows and call asfreq to create rows for missing date. Assign the result to df2. Finally, use df.align to join both indices and fillna

df1 = df.reset_index(-1)
df2 = df1.loc[~df1.index.duplicated()].asfreq('D').ffill()
df_final = df.align(df2.set_index('location_id', append=True))[0].fillna(0)

Out[75]:
                           0
Date       location_id
2020-01-01 100          19.0
           200          75.0
           300          39.0
2020-01-02 100           0.0
2020-01-03 100          11.0
           200          91.0
           300          80.0
2020-01-04 100          36.0
           200          56.0
           300          54.0
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • I tried this answer. This gave me the data frame I wanted, but also duplicated the date index, so for instance I have both `2020-01-01` and `2020-01-01 00:00:00` as indices in the same level. The former has the correct values, while the latter has `NaN` under `location_id` and `0.0` under column `0`. – Rob Jun 24 '20 at 20:56
0

unstack/stack and asfreq/reindex would work:

new_df = df.unstack(fill_value=0)

new_df.index = pd.to_datetime(new_df.index)

new_df.asfreq('D').fillna(0).stack('location_id')

Output:

                           0
Date       location_id      
2020-01-01 100          78.0
           200          25.0
           300          89.0
2020-01-02 100           0.0
           200           0.0
           300           0.0
2020-01-03 100          79.0
           200          23.0
           300          11.0
2020-01-04 100          30.0
           200          79.0
           300          72.0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • This works well, but because I am working with a big data frame, it adds computation time. Is it possible to only add one value for each missing date, as in the example in my question? – Rob Jun 17 '20 at 20:27