So, I have a dataframe like this one:
date ID value
2018-01-01 A 10
2018-02-01 A 11
2018-04-01 A 13
2017-08-01 B 20
2017-10-01 B 21
2017-11-01 B 23
Each group can have very different dates, and there's about 400k groups. So, what I want to do is to fill the missing dates of each group in an efficient way, so it looks like this:
date ID value
2018-01-01 A 10
2018-02-01 A 11
2018-03-01 A nan
2018-04-01 A 13
2017-08-01 B 20
2017-09-01 B nan
2017-10-01 B 21
2017-11-01 B 23
I've tried two approaches:
df2 = df.groupby('ID').apply(lambda x: x.set_index('date').resample('D').pad())
And also:
df2= df.set_index(['date','ID']).unstack().stack(dropna=False).reset_index()
df2= df2.sort_values(by=['ID','date']).reset_index(drop=True)
df2= df2[df2.groupby('ID').value.ffill().notna()]
df2 = df2[df2.groupby('ID').value.bfill().notna()]
The first one, as it uses apply
, it's very slow. I guess I could use something else instead of pad
so I get nan
instead of the previous value, but I'm not sure that will impact the perfomance enough. I waited around 15 minutes and it didn't finish running.
The second one fills from the first date in the whole dataframe to the last one, for every group, which brings a massive dataframe. Afterward I drop all leading and trailing nan
generated by this method. This is quite faster than the first option, but doesn't seem to be the best one. Is there a better way to do this, that's better for big dataframes?