1

Suppose I have a Pandas dataframe with 'Date' column whose values have gaps like below:

>>> import pandas as pd
>>> data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
>>> df = pd.DataFrame(data, columns=['Date','$'])
>>> df
         Date    $
0  2021-01-02  1.0
1  2021-01-05  2.0
2  2021-02-05  3.0

I would like to fill the gaps in the 'Date' column from the period between Jan 01, 2021 to Feb 28, 2021 while copying (forward-filling) the values, so from some reading up on StackOverflow posts like this, I came up with this solution to transform the dataframe as shown below:

# I need to first convert values in 'Date' column to datetime64 type
>>> df['Date'] = pd.to_datetime(df['Date'])
# Then I have to set 'Date' column as the dataframe's index
>>> df = df.set_index(['Date'])
# Without doing the above two steps, the call below returns error
>>> df_new=df.asfreq(freq='D', how={'start':'2021-01-01', 'end':'2021-03-31'}, method='ffill')
>>> df_new
              $
Date
2021-01-02  1.0
2021-01-03  1.0
2021-01-04  1.0
2021-01-05  2.0
2021-01-06  2.0
2021-01-07  2.0
2021-01-08  2.0
2021-01-09  2.0
2021-01-10  2.0
...
2021-01-31  2.0
2021-02-01  2.0
2021-02-02  2.0
2021-02-03  2.0
2021-02-04  2.0
2021-02-05  3.0

But as you can see above, the dates in df_new only starts at '2021-01-02' instead of '2021-01-01' AND it ends on '2021-02-05' instead of '2021-02-28'. I hope I'm entering the input for how parameter correctly above.

Q1: What else do I need to do to make the resulting dataframe look like below:

>>> df_new
              $
Date
2021-01-01  1.0
2021-01-02  1.0
2021-01-03  1.0
2021-01-04  1.0
2021-01-05  2.0
2021-01-06  2.0
2021-01-07  2.0
2021-01-08  2.0
2021-01-09  2.0
2021-01-10  2.0
...
2021-01-31  2.0
2021-02-01  2.0
2021-02-02  2.0
2021-02-03  2.0
2021-02-04  2.0
2021-02-05  3.0
2021-02-06  3.0
...
2021-02-28  3.0

Q2: Is there any way I can accomplish this simpler (i.e. without having to set the 'Date' column as the index of the dataframe for example)

Thanks in advance for your suggestions/answers!

user1330974
  • 2,500
  • 5
  • 32
  • 60

1 Answers1

2

You can find min/max date, create new pd.date_range() using MonthBegin/End date offsets and reindex:

df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()

dr = pd.date_range(
    mn - pd.tseries.offsets.MonthBegin(),
    mx + pd.tseries.offsets.MonthEnd(),
    name="Date",
)

df = df.set_index("Date").reindex(dr).ffill().bfill().reset_index()
print(df)

Prints:

         Date    $
0  2021-01-01  1.0
1  2021-01-02  1.0
2  2021-01-03  1.0
3  2021-01-04  1.0
4  2021-01-05  2.0
5  2021-01-06  2.0

... 

55 2021-02-25  3.0
56 2021-02-26  3.0
57 2021-02-27  3.0
58 2021-02-28  3.0
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 2
    Good solution without needing to hardcode dates! Upvoted. You can also add a parameter `name='Date'` to the call of `pd.date_range()` so that you don't need to rename the column after reset_index. – SeaBean May 17 '21 at 18:38
  • 2
    @SeaBean Thanks for suggestion. These *renamings* in pandas are hard to remember. – Andrej Kesely May 17 '21 at 18:41
  • 1
    @AndrejKesely Thank you! This is just what I needed! One quick follow-up question: if you check, you'll notice that the `$` value from `'2021-01-01'` is carried over into `$` value of `'2021-02-01', '2021-02-02', and '2021-02-03'`. Is there a way to forward-fill only up to the end of each month and backfill up to the beginning of each month? In other words, the `$` for `'2021-01-xx'` should be `1.0` or `2.0` and that from `'2021-02-xx'` should be `3.0`? I can submit a new question if you prefer to answer it in a different thread. Thank you very much again! – user1330974 May 17 '21 at 18:53
  • 1
    @user1330974 You can submit new question. But as a suggestion, you can use `.groupby()` with date year/month and transform the values – Andrej Kesely May 17 '21 at 19:10
  • @AndrejKesely Will do if I cannot figure it out after playing around with `.groupby()` like you suggested. :) Thank you very much again! – user1330974 May 17 '21 at 19:16