1
   df = pd.DataFrame({"ID":[1,1,1,2,2,2],
                   "Date":["29.12.2020","05.01.2021","15.02.2021","11.04.2021","27.05.2021","29.05.2021"],
                   "Amount":[6,5,7,9,8,7]}) 
   df["Date"] = pd.to_datetime(df["Date"])

Data:

ID Date Amount
1 29.12.2020 6
1 05.01.2021 5
1 15.02.2021 7
2 11.04.2021 9
2 27.05.2021 8
2 29.05.2021 7

Desired output:

ID Date Amount
1 01.12.2020 0
. . .
. . .
1 28.12.2020 0
1 29.12.2020 6
. . .
. . .
1 04.01.2020 6
1 05.01.2021 5
. . .
. . .
1 14.02.2021 5
1 15.02.2021 7
. . .
. . .
1 28.02.2021 7
2 01.04.2021 0
. . .
. . .
2 10.04.2021 0
2 11.04.2021 9
. . .
. . .
2 26.05.2021 9
2 27.05.2021 8
2 28.05.2021 8
2 29.05.2021 7

So, basically, I want to expand the data set for the missing dates and insert the latest of preceding amount value for every newly added date.

The date ranges of the ID groups do vary.

Plus, fill value should be 0 for the unobserved values in the month associated with the first observation (e.g. Amount column takes the value of 0 between 01.12.2020 and 28.12.2020, since the first observation of twelfth month of 2020 is on 29th day).

Larx
  • 101
  • 7

1 Answers1

2

I would set the df index to Date, then group by ID and finally reindex depending on the oldest (replacing it with the first day of the month) and most recent dates:

import pandas as pd

df = pd.DataFrame({"ID":[1,1,1,2,2,2],
                   "Date":["29.12.2020","05.01.2021","15.02.2021","11.04.2021","27.05.2021","29.05.2021"],
                   "Amount":[6,5,7,9,8,7]}) 
df["Date"] = pd.to_datetime(df["Date"], format="%d.%m.%Y")
df = df.set_index("Date")

new_df = pd.DataFrame()
for id_val, obs_period in df.groupby("ID"):
    date_range = pd.date_range(min(obs_period.index).replace(day=1), max(obs_period.index))
    obs_period = obs_period.reindex(date_range, fill_value=pd.NA)
    obs_period["ID"] = id_val
    if pd.isna(obs_period.at[obs_period.index[0], "Amount"]):
        obs_period.at[obs_period.index[0], "Amount"] = 0  # adding 0 at the beginning of the period if undefined
    obs_period= obs_period.ffill()      # filling Amount with last value
    new_df = pd.concat([new_df, obs_period])

print(new_df)

BTW you should specify your date format while converting df["Date"]

Output:

            ID  Amount
2020-12-01   1     0.0
2020-12-02   1     0.0
2020-12-03   1     0.0
2020-12-04   1     0.0
2020-12-05   1     0.0
...         ..     ...
2021-05-25   2     9.0
2021-05-26   2     9.0
2021-05-27   2     8.0
2021-05-28   2     8.0
2021-05-29   2     7.0

[136 rows x 2 columns]
Tranbi
  • 11,407
  • 6
  • 16
  • 33
  • Noted. Thank you. BTW the issue is not expanding the dataframe for missing dates and filling them with zeros. They should be filled with the latest non-null if it exists or 0 if it does not. – Larx Nov 11 '21 at 08:20
  • 1
    Understood. In that case you can use `ffill`. I updated my answer! – Tranbi Nov 11 '21 at 09:01
  • 1
    It does not extent to unobserved future dates (such as 16-28 Feb of first ID group), however I decided that this is better for my actualy purpose and use your code as is. Thank you. – Larx Nov 11 '21 at 09:36
  • 1
    I'm glad it helped! Finding the last day of the month is not as trivial as finding the first one but it's only a few lines. If you change your mind check this SO [answer](https://stackoverflow.com/a/13565185/13525512) – Tranbi Nov 11 '21 at 09:43