0

So, I have a dataframe like this one:

   date       ID   value
2018-01-01    A     10
2018-02-01    A     11
2018-04-01    A     13
2017-08-01    B     20
2017-10-01    B     21
2017-11-01    B     23

Each group can have very different dates, and there's about 400k groups. So, what I want to do is to fill the missing dates of each group in an efficient way, so it looks like this:

   date       ID   value
2018-01-01    A     10
2018-02-01    A     11
2018-03-01    A     nan
2018-04-01    A     13
2017-08-01    B     20
2017-09-01    B     nan
2017-10-01    B     21
2017-11-01    B     23

I've tried two approaches:

df2 = df.groupby('ID').apply(lambda x: x.set_index('date').resample('D').pad())

And also:

df2= df.set_index(['date','ID']).unstack().stack(dropna=False).reset_index()
df2= df2.sort_values(by=['ID','date']).reset_index(drop=True)
df2=  df2[df2.groupby('ID').value.ffill().notna()]
df2 = df2[df2.groupby('ID').value.bfill().notna()]

The first one, as it uses apply, it's very slow. I guess I could use something else instead of pad so I get nan instead of the previous value, but I'm not sure that will impact the perfomance enough. I waited around 15 minutes and it didn't finish running.

The second one fills from the first date in the whole dataframe to the last one, for every group, which brings a massive dataframe. Afterward I drop all leading and trailing nan generated by this method. This is quite faster than the first option, but doesn't seem to be the best one. Is there a better way to do this, that's better for big dataframes?

Juan C
  • 5,846
  • 2
  • 17
  • 51

0 Answers0