4

Question: Using pandas -- how to efficiently fill-in missing dates with zero values, with monthly (e.g. last day indexed) frequency, relative to the min/max date values per group?

Edit do not assume input dates to correspond to the last day in its month. To deal with this issue, add the following line to the proposed answers below:

df.date = df.date + pd.offsets.MonthEnd(0) 

Without this fix, filled in values with freq='M' can results in NA's...!

Note: pandas version 0.24.2

Example input:

data = [{'name': 'A', 'date': '2019-01-01', 'val': 10},
{'name': 'A', 'date': '2019-04-30', 'val': 2},
{'name': 'B', 'date': '2019-02-15', 'val': 6},
{'name': 'B', 'date': '2019-05-01', 'val': 5}]

df = pd.DataFrame(data)

         date name  val
0  2019-01-01    A   10
1  2019-04-30    A    2
2  2019-02-15    B    6
3  2019-05-01    B    5

Note that the dates in the input are not necessarily the first or last day of their corresponding month.

Example desired output

         date name  val
0  2019-01-31    A   10
1  2019-02-28    A    0
2  2019-03-31    A    0
3  2019-04-30    A    2
4  2019-02-28    B    6
5  2019-03-31    B    0
6  2019-04-30    B    0
7  2019-05-31    B    5

Attempts:

The following works at the index level but fills everything with NA:

df['date'] = pd.to_datetime(df['date'])

dg = df.groupby('name').apply(lambda x: x.reindex(pd.date_range(min(x.date), max(x.date), freq='M')))

Also:

Pandas filling missing dates and values within group

The above link's answer appear not be not relative to for each group, but rather the entire dataset's min/max date values.

Quetzalcoatl
  • 2,016
  • 4
  • 26
  • 36

2 Answers2

2

I would use groupby, resample and asfreq (edit: as you updated the question on non-MonthEnd dates. I added pd.offsets.MonthEnd as you proposed)

df.date = df.date + pd.offsets.MonthEnd(0) 
(df.set_index('date').groupby('name').resample('M')
                     .asfreq(fill_value=0).drop('name',1)
                     .reset_index())

Out[550]:
  name       date  val
0    A 2019-01-31   10
1    A 2019-02-28    0
2    A 2019-03-31    0
3    A 2019-04-30    2
4    B 2019-02-28    6
5    B 2019-03-31    0
6    B 2019-04-30    0
7    B 2019-05-31    5
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • 1
    to play safe with "resample('M')", add the following line: df.date = df.date + pd.offsets.MonthEnd(0) – Quetzalcoatl Sep 26 '19 at 02:57
  • 1
    @Quetzalcoatl: ah, I see your point. You want to handle anomoly cases where some values in `df.date` is not month-end date :) – Andy L. Sep 26 '19 at 03:03
  • 1
    @Quetzalcoatl: I added `pd.offsets.MonthEnd` as you proposed to make the answer completed :) – Andy L. Sep 26 '19 at 03:11
  • @AndyL.: How would you add same range of dates for each group? I mean each group has same range (ie, 2020 Jan to 2020 Dec). How would you do that? – Roy May 09 '21 at 04:44
1

One quick fix:

df.date = pd.to_datetime(df.date)

new_df = (df.set_index('date')
            .groupby('name', as_index=False)
            .apply(lambda x: x.resample('M').interpolate())
            .reset_index(0, drop=True)
         )

s = new_df['name'].isna() 
new_df.loc[s, 'val'] = 0
new_df['name'] = new_df['name'].ffill()

Output:

           name   val
date                 
2019-01-31    A  10.0
2019-02-28    A   0.0
2019-03-31    A   0.0
2019-04-30    A   2.0
2019-02-28    B   6.0
2019-03-31    B   0.0
2019-04-30    B   0.0
2019-05-31    B   5.0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • 1
    it seems that using "interpolate" will sometimes (on large datasets) result in the final output having all NA's -- changing your "interpolate" to "max" seems to solve this problem. – Quetzalcoatl Sep 26 '19 at 01:23
  • to play safe with "resample('M')", add the following line: df.date = df.date + pd.offsets.MonthEnd(0) – Quetzalcoatl Sep 26 '19 at 02:57