0

Suppose I have a dataframe which has Date and $ columns like below:

>>> import pandas as pd
>>> data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
>>> df = pd.DataFrame(data, columns=['Date','$'])

Since the original data has date gaps, I filled the date gaps using a recommendation answered in this StackOverflow post like below:

>>> df.Date = pd.to_datetime(df.Date)
>>> mn = df.Date.min()
>>> mx = df.Date.max()
>>>
>>> dr = pd.date_range(
...     mn - pd.tseries.offsets.MonthBegin(),
...     mx + pd.tseries.offsets.MonthEnd(),
...     name="Date",
... )
>>>
>>> df = df.set_index("Date").reindex(dr).ffill().bfill().reset_index()
>>> print(df)
         Date    $
0  2021-01-01  1.0
1  2021-01-02  1.0
2  2021-01-03  1.0
3  2021-01-04  1.0
4  2021-01-05  2.0
5  2021-01-06  2.0
6  2021-01-07  2.0
7  2021-01-08  2.0
8  2021-01-09  2.0
9  2021-01-10  2.0
10 2021-01-11  2.0
11 2021-01-12  2.0
12 2021-01-13  2.0
13 2021-01-14  2.0
14 2021-01-15  2.0
15 2021-01-16  2.0
16 2021-01-17  2.0
17 2021-01-18  2.0
18 2021-01-19  2.0
19 2021-01-20  2.0
20 2021-01-21  2.0
21 2021-01-22  2.0
22 2021-01-23  2.0
23 2021-01-24  2.0
24 2021-01-25  2.0
25 2021-01-26  2.0
26 2021-01-27  2.0
27 2021-01-28  2.0
28 2021-01-29  2.0
29 2021-01-30  2.0
30 2021-01-31  2.0
31 2021-02-01  2.0 # <== here, the $ value should be 3.0 and onward
32 2021-02-02  2.0
33 2021-02-03  2.0
34 2021-02-04  2.0
35 2021-02-05  3.0
36 2021-02-06  3.0
37 2021-02-07  3.0
38 2021-02-08  3.0
39 2021-02-09  3.0
40 2021-02-10  3.0
41 2021-02-11  3.0
42 2021-02-12  3.0
43 2021-02-13  3.0
44 2021-02-14  3.0
45 2021-02-15  3.0
46 2021-02-16  3.0
47 2021-02-17  3.0
48 2021-02-18  3.0
49 2021-02-19  3.0
50 2021-02-20  3.0
51 2021-02-21  3.0
52 2021-02-22  3.0
53 2021-02-23  3.0
54 2021-02-24  3.0
55 2021-02-25  3.0
56 2021-02-26  3.0
57 2021-02-27  3.0
58 2021-02-28  3.0

Using that approach, the forward fill, ffill, without checking the month's boundary copies the $ values a bit too far as you can see above. I looked around StackOverflow and found that (e.g., this post) there's a way to use groupby() to know the boundary of each month. That leads me to this point below:

>>> start_date_of_each_month = (df.set_index('Date').index.to_series().groupby(pd.Grouper(freq='M')).min())
>>> start_date_of_each_month
Date
2021-01-31   2021-01-01
2021-02-28   2021-02-01
Freq: M, Name: Date, dtype: datetime64[ns]

Q: How can I utilize this to correct the $ above so that each month's values are contained within that particular month? In particular, how do I transform the df to look like this?

         Date    $
0  2021-01-01  1.0
1  2021-01-02  1.0
2  2021-01-03  1.0
3  2021-01-04  1.0
4  2021-01-05  2.0
5  2021-01-06  2.0
6  2021-01-07  2.0
7  2021-01-08  2.0
8  2021-01-09  2.0
9  2021-01-10  2.0
10 2021-01-11  2.0
11 2021-01-12  2.0
12 2021-01-13  2.0
13 2021-01-14  2.0
14 2021-01-15  2.0
15 2021-01-16  2.0
16 2021-01-17  2.0
17 2021-01-18  2.0
18 2021-01-19  2.0
19 2021-01-20  2.0
20 2021-01-21  2.0
21 2021-01-22  2.0
22 2021-01-23  2.0
23 2021-01-24  2.0
24 2021-01-25  2.0
25 2021-01-26  2.0
26 2021-01-27  2.0
27 2021-01-28  2.0
28 2021-01-29  2.0
29 2021-01-30  2.0
30 2021-01-31  2.0
31 2021-02-01  3.0 # <== here, the $ value should be 3.0
32 2021-02-02  3.0
33 2021-02-03  3.0
34 2021-02-04  3.0
35 2021-02-05  3.0
36 2021-02-06  3.0
37 2021-02-07  3.0
38 2021-02-08  3.0
39 2021-02-09  3.0
40 2021-02-10  3.0
41 2021-02-11  3.0
42 2021-02-12  3.0
43 2021-02-13  3.0
44 2021-02-14  3.0
45 2021-02-15  3.0
46 2021-02-16  3.0
47 2021-02-17  3.0
48 2021-02-18  3.0
49 2021-02-19  3.0
50 2021-02-20  3.0
51 2021-02-21  3.0
52 2021-02-22  3.0
53 2021-02-23  3.0
54 2021-02-24  3.0
55 2021-02-25  3.0
56 2021-02-26  3.0
57 2021-02-27  3.0
58 2021-02-28  3.0

Thanks in advance for your answers/suggestions!

user1330974
  • 2,500
  • 5
  • 32
  • 60

1 Answers1

1

Some options:

  1. pd.Grouper
import pandas as pd

data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])

df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()

dr = pd.date_range(
    mn - pd.tseries.offsets.MonthBegin(),
    mx + pd.tseries.offsets.MonthEnd(),
    name="Date",
)

df = df.set_index("Date").reindex(dr).reset_index()

df['$'] = df.groupby(pd.Grouper(key='Date', freq='1M'))['$'].ffill().bfill()

print(df)
  1. dt.strftime %m:
import pandas as pd

data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])

df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()

dr = pd.date_range(
    mn - pd.tseries.offsets.MonthBegin(),
    mx + pd.tseries.offsets.MonthEnd(),
    name="Date",
)

df = df.set_index("Date").reindex(dr).reset_index()

df['$'] = df.groupby(df['Date'].dt.strftime('%m'))['$'].ffill().bfill()

print(df)
  1. Resample
import pandas as pd

data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])

df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()

dr = pd.date_range(
    mn - pd.tseries.offsets.MonthBegin(),
    mx + pd.tseries.offsets.MonthEnd(),
    name="Date",
)

df = df.set_index("Date").reindex(dr)

df['$'] = df.resample('M')['$'].ffill().bfill()

df = df.reset_index()

print(df)

All produce:

         Date    $
0  2021-01-01  1.0
1  2021-01-02  1.0
2  2021-01-03  1.0
3  2021-01-04  1.0
4  2021-01-05  2.0
5  2021-01-06  2.0
6  2021-01-07  2.0
7  2021-01-08  2.0
      ...     
30 2021-01-31  2.0
31 2021-02-01  3.0
32 2021-02-02  3.0
33 2021-02-03  3.0
34 2021-02-04  3.0
35 2021-02-05  3.0
36 2021-02-06  3.0
37 2021-02-07  3.0
38 2021-02-08  3.0
      ...
57 2021-02-27  3.0
58 2021-02-28  3.0

Depending on how you want to handle months with no values you can apply ffill and bfill instead:

import pandas as pd

data = [['2020-12-02', 1.0], ['2021-02-05', 2.0], ['2021-03-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])

df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()

dr = pd.date_range(
    mn - pd.tseries.offsets.MonthBegin(),
    mx + pd.tseries.offsets.MonthEnd(),
    name="Date",
)

df = df.set_index("Date").reindex(dr)

df['$'] = df.resample('M')['$'].apply(lambda s: s.ffill().bfill())

df = df.reset_index()

print(df.to_string())

January has no values so it stays NaN as opposed to previous methods which would have filled January with the last known value 1.0.

          Date    $
0   2020-12-01  1.0
       ...
30  2020-12-31  1.0
31  2021-01-01  NaN
       ...
61  2021-01-31  NaN
62  2021-02-01  2.0
       ...
89  2021-02-28  2.0
90  2021-03-01  3.0
91  2021-03-02  3.0
       ...
120 2021-03-31  3.0
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Thank you so much for providing various ways to handle this issue! From your post, I've learned quite a bit about how `Groupby()` works in `Pandas`. :) Follow-up question: if I have other columns (e.g., `Spend` and `Running Total`) in the `df`, do I need to call `df['$'] = df.groupby(pd.Grouper(key='Date', freq='1M'))['$'].ffill().bfill()`; `df['Spend'] = df.groupby(pd.Grouper(key='Date', freq='1M'))['Spend'].ffill().bfill()` and `df['Running Total'] = df.groupby(pd.Grouper(key='Date', freq='1M'))['Running Total'].ffill().bfill()` or is there a way to do it all in one line? – user1330974 May 17 '21 at 20:33