0

I have a list of dates and a dataframe. Now the dataframe has an id column and other values that are not consistent for all dates. I want to fill zeros in all columns for the ids and dates where there is no data. Let me show you by example:

    date     id     clicks    conv    rev
  2019-01-21 234      34        1     10
  2019-01-21 235      32        0     0
  2019-01-24 234      56        2     20
  2019-01-23 235      23        3     30

date list is like this:

    [2019-01-01, 2019-01-02,2019-01-03 ....2019-02-28]

What I want is to add zeros for all the missing dates in the dataframe for all ids. So the resultant df should look like:

    date     id     clicks    conv    rev
  2019-01-01 234      0         0     0
  2019-01-01 235      0         0     0
     .                .         .     .
     .                .         .     .

  2019-01-21 234      34        1     10
  2019-01-21 235      32        0     0
  2019-01-22 234      0         0     0
  2019-01-22 235      0         0     0
  2019-01-23 234      0         0     0
  2019-01-23 235      0         0     0
  2019-01-24 234      56        2     20
  2019-01-23 235      23        3     30
      .               .         .      .
  2019-02-28  0       0         0      0
N91
  • 395
  • 1
  • 3
  • 14

1 Answers1

2

With set_index + reindex from the cartesian product of values. Here I'll create the dates with pd.date_range to save some typing, and ensure dates are datetime

import pandas as pd

df['date'] = pd.to_datetime(df.date)
my_dates = pd.date_range('2019-01-01', '2019-02-28', freq='D')

idx = pd.MultiIndex.from_product([my_dates, df.id.unique()], names=['date', 'id'])
df = df.set_index(['date', 'id']).reindex(idx).fillna(0).reset_index()

Output: df

          date     id  clicks  conv   rev
0   2019-01-01    234     0.0   0.0   0.0
1   2019-01-01    235     0.0   0.0   0.0
...
45  2019-01-23    235    23.0   3.0  30.0
46  2019-01-24    234    56.0   2.0  20.0
47  2019-01-24    235     0.0   0.0   0.0
...
115 2019-02-27    235     0.0   0.0   0.0
116 2019-02-28    234     0.0   0.0   0.0
117 2019-02-28    235     0.0   0.0   0.0
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Exception: cannot handle a non-unique multi-index! I get this error – N91 Apr 27 '19 at 17:41
  • 1
    @Nofy then your original DataFrame has multiple row where the same ID has the same date listed more than once. Is that expected? If it is, how do you want to deal with these duplicated rows? – ALollz Apr 27 '19 at 17:44
  • Yes, that's correct. I handled that and it is now working. I just get this warning: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead – N91 Apr 27 '19 at 17:52
  • 1
    @Nofy yes, that is likely unrelated to this problem. If you did a `.drop_duplicates()` add `.copy()` at the end, since dropping duplicates returns a view. Otherwise you can do `df = df.copy()` prior to this operation to remove the warning. This post is very useful with regard to that warning: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas/53954986#53954986 – ALollz Apr 27 '19 at 17:58