1

I have a DataFrame consisting of dates, other columns and a numerical value, where some value combinations in "other columns" could be missing, and I want to populate them from previous dates.

Example. Say the DataFrame is like below. You can see on 2016-01-01, we have data for (LN, A), (LN, B), (NY, A) and (NY, B) on columns (location, band).

        date  location  band  value
0 2016-01-01        LN     A   10.0
1 2016-01-01        LN     B    5.0
2 2016-01-01        NY     A    9.0
3 2016-01-01        NY     B    6.0
4 2016-01-02        LN     A   11.0
5 2016-01-02        NY     B    7.0
6 2016-01-03        NY     A   10.0

Then you notice on 2016-01-02, we only have (LN, A) and (NY, B), but (LN, B) and (NY, A) are missing. Again, on 2016-01-03, only (NY, A) is available; all other three combinations are missing.

What I want to do is to populate the missing combinations of each date from its predecessor. Say for 2016-01-02, I would like to add two more rows, "rolled over" from 2016-01-01: (LN, B, 5.0) and (NY, A, 9.0) for columns (location, band, value). Same for 2016-01-03. So as to make the whole thing like below:

        date  location  band  value
 0 2016-01-01        LN     A   10.0
 1 2016-01-01        LN     B    5.0
 2 2016-01-01        NY     A    9.0
 3 2016-01-01        NY     B    6.0
 4 2016-01-02        LN     A   11.0
 5 2016-01-02        NY     B    7.0
 6 2016-01-03        NY     A   10.0
 7 2016-01-02        LN     B    5.0
 8 2016-01-02        NY     A    9.0
 9 2016-01-03        LN     A   11.0
10 2016-01-03        LN     B    5.0
11 2016-01-03        NY     B    7.0

Note rows 7-11 are populated from rows 1, 2, 4, 7 and 5, respectively. The order is not really important as I can always sort afterwards if all the data I need is present.

Anyone to help? Thanks a lot!

Light
  • 33
  • 9

2 Answers2

2

You can use a unstack/stack method to get all missing values, followed by a forward fill:

# Use unstack/stack to add missing locations.
df = df.set_index(['date', 'location', 'band']) \
       .unstack(level=['location', 'band']) \
       .stack(level=['location', 'band'], dropna=False)

# Forward fill NaN values within ['location', 'band'] groups.
df = df.groupby(level=['location', 'band']).ffill().reset_index()

Or you can directly build a MultiIndex containing all combinations:

# Build the full MultiIndex, set the partial MultiIndex, and reindex.
levels = ['date', 'location', 'band']
full_idx = pd.MultiIndex.from_product([df[col].unique() for col in levels], names=levels)
df = df.set_index(levels).reindex(full_idx)

# Forward fill NaN values within ['location', 'band'] groups.
df = df.groupby(level=['location', 'band']).ffill().reset_index()

The resulting output for either method:

         date location band  value
0  2016-01-01       LN    A   10.0
1  2016-01-01       LN    B    5.0
2  2016-01-01       NY    A    9.0
3  2016-01-01       NY    B    6.0
4  2016-01-02       LN    A   11.0
5  2016-01-02       LN    B    5.0
6  2016-01-02       NY    A    9.0
7  2016-01-02       NY    B    7.0
8  2016-01-03       LN    A   11.0
9  2016-01-03       LN    B    5.0
10 2016-01-03       NY    A   10.0
11 2016-01-03       NY    B    7.0
root
  • 32,715
  • 6
  • 74
  • 87
  • Thanks for the answer! I actually prefer your first solution to the second one - see my reply to @Dickster whose answer is essentially the same as your second one. – Light Oct 18 '16 at 06:59
  • This question is old, but it's exactly what I'm looking for at the moment - the only difference is that I want it to be filled with NaN. How can I do that? If I remove the ffill, it does not work to reset the index (which is needed)... – Filipe May 30 '21 at 20:28
0

My solution, in summary using the product operation to get all the combinations in a multi index, then some stacking and ffill().

df =pd.DataFrame({'date': {0: '2016-01-01', 1: '2016-01-01', 2: '2016-01-01', 3: '2016-01-01', 4: '2016-01-02', 5: '2016-01-02', 6: '2016-01-03'}, 'band': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B', 6: 'A'}, 'location': {0: 'LN', 1: 'LN', 2: 'NY', 3: 'NY', 4: 'LN', 5: 'NY', 6: 'NY'}, 'value': {0: 10, 1: 5, 2: 9, 3: 6, 4: 11, 5: 7, 6: 10}})
unique_dates = df['date'].unique()
df.set_index(['date','location','band'],inplace=True)
idx = pd.MultiIndex.from_product([unique_dates,['LN','NY'],['A','B']])
df  = df.reindex(idx)
df = df.unstack(level=[2,1])

which produces:

             value                      
                 A      B       A      B
                LN     LN      NY     NY
2016-01-01 10.0000 5.0000  9.0000 6.0000
2016-01-02 11.0000    nan     nan 7.0000
2016-01-03     nan    nan 10.0000    nan

and finally:

df = df.ffill()
df = df.stack().stack()
print df


                  value
2016-01-01 LN A 10.0000
              B  5.0000
           NY A  9.0000
              B  6.0000
2016-01-02 LN A 11.0000
              B  5.0000
           NY A  9.0000
              B  7.0000
2016-01-03 LN A 11.0000
              B  5.0000
           NY A 10.0000
              B  7.0000
Dickster
  • 2,969
  • 3
  • 23
  • 29
  • Thanks a lot! Afraid I might wanna take the other answer as the solution to this problem (the unstack/stack one), because (partially of my fault of giving a bad example in my question) I might not need the full combination of dates, ['LN', 'NY'] and ['A', 'B'].Say if the input – Light Oct 18 '16 at 07:02
  • Thanks a lot! Afraid I might wanna take the other answer as the solution to this problem (the `unstack`/`stack` one), because (partially of my fault of giving a bad example in my question) I might not need the full combination of dates, `['LN', 'NY']` and `['A', 'B']`. Say if the input in my original example did not have rows 0 and 4, then I wouldn't expect row 9 in the output either, which can be achieved via the solution from @root. Thanks again for the answer anyway : ) – Light Oct 18 '16 at 07:09