Suppose I have a dataframe which has Date
and $
columns like below:
>>> import pandas as pd
>>> data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
>>> df = pd.DataFrame(data, columns=['Date','$'])
Since the original data has date gaps, I filled the date gaps using a recommendation answered in this StackOverflow post like below:
>>> df.Date = pd.to_datetime(df.Date)
>>> mn = df.Date.min()
>>> mx = df.Date.max()
>>>
>>> dr = pd.date_range(
... mn - pd.tseries.offsets.MonthBegin(),
... mx + pd.tseries.offsets.MonthEnd(),
... name="Date",
... )
>>>
>>> df = df.set_index("Date").reindex(dr).ffill().bfill().reset_index()
>>> print(df)
Date $
0 2021-01-01 1.0
1 2021-01-02 1.0
2 2021-01-03 1.0
3 2021-01-04 1.0
4 2021-01-05 2.0
5 2021-01-06 2.0
6 2021-01-07 2.0
7 2021-01-08 2.0
8 2021-01-09 2.0
9 2021-01-10 2.0
10 2021-01-11 2.0
11 2021-01-12 2.0
12 2021-01-13 2.0
13 2021-01-14 2.0
14 2021-01-15 2.0
15 2021-01-16 2.0
16 2021-01-17 2.0
17 2021-01-18 2.0
18 2021-01-19 2.0
19 2021-01-20 2.0
20 2021-01-21 2.0
21 2021-01-22 2.0
22 2021-01-23 2.0
23 2021-01-24 2.0
24 2021-01-25 2.0
25 2021-01-26 2.0
26 2021-01-27 2.0
27 2021-01-28 2.0
28 2021-01-29 2.0
29 2021-01-30 2.0
30 2021-01-31 2.0
31 2021-02-01 2.0 # <== here, the $ value should be 3.0 and onward
32 2021-02-02 2.0
33 2021-02-03 2.0
34 2021-02-04 2.0
35 2021-02-05 3.0
36 2021-02-06 3.0
37 2021-02-07 3.0
38 2021-02-08 3.0
39 2021-02-09 3.0
40 2021-02-10 3.0
41 2021-02-11 3.0
42 2021-02-12 3.0
43 2021-02-13 3.0
44 2021-02-14 3.0
45 2021-02-15 3.0
46 2021-02-16 3.0
47 2021-02-17 3.0
48 2021-02-18 3.0
49 2021-02-19 3.0
50 2021-02-20 3.0
51 2021-02-21 3.0
52 2021-02-22 3.0
53 2021-02-23 3.0
54 2021-02-24 3.0
55 2021-02-25 3.0
56 2021-02-26 3.0
57 2021-02-27 3.0
58 2021-02-28 3.0
Using that approach, the forward fill, ffill
, without checking the month's boundary copies the $
values a bit too far as you can see above. I looked around StackOverflow and found that (e.g., this post) there's a way to use groupby()
to know the boundary of each month. That leads me to this point below:
>>> start_date_of_each_month = (df.set_index('Date').index.to_series().groupby(pd.Grouper(freq='M')).min())
>>> start_date_of_each_month
Date
2021-01-31 2021-01-01
2021-02-28 2021-02-01
Freq: M, Name: Date, dtype: datetime64[ns]
Q: How can I utilize this to correct the $
above so that each month's values are contained within that particular month? In particular, how do I transform the df
to look like this?
Date $
0 2021-01-01 1.0
1 2021-01-02 1.0
2 2021-01-03 1.0
3 2021-01-04 1.0
4 2021-01-05 2.0
5 2021-01-06 2.0
6 2021-01-07 2.0
7 2021-01-08 2.0
8 2021-01-09 2.0
9 2021-01-10 2.0
10 2021-01-11 2.0
11 2021-01-12 2.0
12 2021-01-13 2.0
13 2021-01-14 2.0
14 2021-01-15 2.0
15 2021-01-16 2.0
16 2021-01-17 2.0
17 2021-01-18 2.0
18 2021-01-19 2.0
19 2021-01-20 2.0
20 2021-01-21 2.0
21 2021-01-22 2.0
22 2021-01-23 2.0
23 2021-01-24 2.0
24 2021-01-25 2.0
25 2021-01-26 2.0
26 2021-01-27 2.0
27 2021-01-28 2.0
28 2021-01-29 2.0
29 2021-01-30 2.0
30 2021-01-31 2.0
31 2021-02-01 3.0 # <== here, the $ value should be 3.0
32 2021-02-02 3.0
33 2021-02-03 3.0
34 2021-02-04 3.0
35 2021-02-05 3.0
36 2021-02-06 3.0
37 2021-02-07 3.0
38 2021-02-08 3.0
39 2021-02-09 3.0
40 2021-02-10 3.0
41 2021-02-11 3.0
42 2021-02-12 3.0
43 2021-02-13 3.0
44 2021-02-14 3.0
45 2021-02-15 3.0
46 2021-02-16 3.0
47 2021-02-17 3.0
48 2021-02-18 3.0
49 2021-02-19 3.0
50 2021-02-20 3.0
51 2021-02-21 3.0
52 2021-02-22 3.0
53 2021-02-23 3.0
54 2021-02-24 3.0
55 2021-02-25 3.0
56 2021-02-26 3.0
57 2021-02-27 3.0
58 2021-02-28 3.0
Thanks in advance for your answers/suggestions!