2

Suppose I have a dataframe containing certain events that happen in each month. The data only have months and years of the events and how many of that events happen every month.

df = pd.DataFrame({'month':['2018-01', '2018-02', '2018-04','2018-05','2018-06', 
                            '2018-07', '2018-10','2018-11', '2019-01', '2019-02',
                            '2019-03', '2019-05','2019-07', '2019-11', '2019-12'],
                  'counts':[10,5,6,1,2,5,7,8,9,1,10,12,8,10,4]})

df
    month  counts
0   2018-01 10
1   2018-02 5
2   2018-04 6
3   2018-05 1
4   2018-06 2
5   2018-07 5
6   2018-10 7
7   2018-11 8
8   2019-01 9
9   2019-02 1
10  2019-03 10
11  2019-05 12
12  2019-07 10
13  2019-11 10
14  2019-12 4

As you notice above, there is a time range between January 2018 to December 2019, but not all months have count values in them. For example, there is no data for March 2018 (2018-03), and there are many missing months in between them.

I want to put this missing months and fill it with zero, so basically I want to insert {'month':'2018-03', count:0} in the right order. I also want to do the same thing to all the missing months and values that are supposed to be there.

What I have done is the following.

I converted the month to the appropriate format.

df['month'] = pd.to_datetime(df['month']).dt.to_period('M')

The code above worked fine.

Then I tried to create a date range in monthly frequency, but this does not work.

idx = pd.date_range(min(df['month']), max(df['month']), freq='M)

The error says ValueError: Cannot convert Period to Timestamp unambiguously. Use to_timestamp

What do I do? Thanks.

catris25
  • 1,173
  • 3
  • 20
  • 40
  • there seems to be a similar topic, but only with dates and not months: https://stackoverflow.com/questions/19324453/add-missing-dates-to-pandas-dataframe – paloman Jul 21 '20 at 11:43

1 Answers1

1

Use period_range, then convert periods column to PeriodIndex and use DataFrame.reindex:

df['month'] = pd.to_datetime(df['month']).dt.to_period('M')
idx = pd.period_range(df['month'].min(), df['month'].max(), freq='M')
df = df.set_index('month').reindex(idx, fill_value=0)
print (df)
         counts
2018-01      10
2018-02       5
2018-03       0
2018-04       6
2018-05       1
2018-06       2
2018-07       5
2018-08       0
2018-09       0
2018-10       7
2018-11       8
2018-12       0
2019-01       9
2019-02       1
2019-03      10
2019-04       0
2019-05      12
2019-06       0
2019-07       8
2019-08       0
2019-09       0
2019-10       0
2019-11      10
2019-12       4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252