Pandas- adding missing dates to DataFrame while keeping column/index values?

Question

I have a pandas dataframe that incorporates dates, customers, items, and then dollar value for purchases.

   date     customer   product   amt  
 1/1/2017   tim        apple       3  
 1/1/2017   jim        melon       2  
 1/1/2017   tom        apple       5  
 1/1/2017   tom        melon       4  
 1/4/2017   tim        melon       3  
 1/4/2017   jim        apple       2  
 1/4/2017   tom        melon       1  
 1/4/2017   tom        orange      4

I'm trying to just look at performance, but I want to forward fill all dates from my min and max date range, and also fill for each customer for each product

Something like:

   date     customer   product   amt  
 1/1/2017   tim        apple       3  
 1/1/2017   tim        melon       0  
 1/1/2017   tim        orange      0  
 1/1/2017   jim        melon       2  
 1/1/2017   jim        apple       0  
 1/1/2017   jim        orange      0  
 1/1/2017   tom        apple       5  
 1/1/2017   tom        melon       4  
 1/1/2017   tom        orange      0  
 1/2/2017   tim        apple       0  
 1/2/2017   tim        melon       0  
 1/2/2017   tim        orange      0  
 1/2/2017   jim        melon       0  
 1/2/2017   jim        apple       0  
 1/2/2017   jim        orange      0  
 1/2/2017   tom        apple       0  
 1/2/2017   tom        melon       0  
 1/2/2017   tom        orange      0  
 1/3/2017   tim        apple       0  
 1/3/2017   tim        melon       0  
 1/3/2017   tim        orange      0  
 1/3/2017   jim        melon       0  
 1/3/2017   jim        apple       0  
 1/3/2017   jim        orange      0  
 1/3/2017   tom        apple       0  
 1/3/2017   tom        melon       0  
 1/3/2017   tom        orange      0  
 1/4/2017   tim        melon       3  
 1/4/2017   tim        apple       0  
 1/4/2017   tim        orange      0  
 1/4/2017   jim        apple       2  
 1/4/2017   jim        melon       0  
 1/4/2017   jim        orange      0  
 1/4/2017   tom        melon       1  
 1/4/2017   tom        orange      4  
 1/4/2017   tom        apple       0

I know that I can create a reindex based off of the max and min dates, but this also makes my customer and product values 0. Is there any other way to go about this? Am I missing a step or something? Appreciate the help

You want to use the `ffill` method as in [this](https://stackoverflow.com/a/33261465/4819376) answer — rpanai, Dec 18 '17 at 21:50
however this doesn't fill with dates? and doesn't keep the columns customer and product — Dick Thompson, Dec 18 '17 at 21:58

BENY · Accepted Answer · 2017-12-18T22:34:30.737

2

Notice ,this using the stack and unstack couple of times

df.set_index(['date','customer','product']).amt.unstack(-3).\
  reindex(columns=pd.date_range(df['date'].min(), 
    df['date'].max()),fill_value=0).\
      stack(dropna=False).unstack().stack(dropna=False).\
        unstack('customer').stack(dropna=False).reset_index().\
          fillna(0).sort_values(['level_1','customer','product'])
Out[314]: 
   product    level_1 customer    0
0    apple 2017-01-01      jim  0.0
12   melon 2017-01-01      jim  2.0
24  orange 2017-01-01      jim  0.0
1    apple 2017-01-01      tim  3.0
13   melon 2017-01-01      tim  0.0
25  orange 2017-01-01      tim  0.0
2    apple 2017-01-01      tom  5.0
14   melon 2017-01-01      tom  4.0
26  orange 2017-01-01      tom  0.0
3    apple 2017-01-02      jim  0.0
15   melon 2017-01-02      jim  0.0
27  orange 2017-01-02      jim  0.0
4    apple 2017-01-02      tim  0.0
16   melon 2017-01-02      tim  0.0
28  orange 2017-01-02      tim  0.0
5    apple 2017-01-02      tom  0.0
17   melon 2017-01-02      tom  0.0
29  orange 2017-01-02      tom  0.0
6    apple 2017-01-03      jim  0.0
18   melon 2017-01-03      jim  0.0
30  orange 2017-01-03      jim  0.0
7    apple 2017-01-03      tim  0.0
19   melon 2017-01-03      tim  0.0
31  orange 2017-01-03      tim  0.0
8    apple 2017-01-03      tom  0.0
20   melon 2017-01-03      tom  0.0
32  orange 2017-01-03      tom  0.0
9    apple 2017-01-04      jim  2.0
21   melon 2017-01-04      jim  0.0
33  orange 2017-01-04      jim  0.0
10   apple 2017-01-04      tim  0.0
22   melon 2017-01-04      tim  3.0
34  orange 2017-01-04      tim  0.0
11   apple 2017-01-04      tom  0.0
23   melon 2017-01-04      tom  1.0
35  orange 2017-01-04      tom  4.0

edited Dec 18 '17 at 22:34

answered Dec 18 '17 at 22:03

BENY

317,841
20
164
234

@DickThompson ran it what kind of error code you received ? – BENY Dec 18 '17 at 22:32
@DickThompson sorry typo, change | to \ , you will get teh result – BENY Dec 18 '17 at 22:35
ah ok- and what would I do if I had another column classifier? like payment type? Just add to the indexes? – Dick Thompson Dec 18 '17 at 22:53
@DickThompson yes and you may need another round of stack and unstack – BENY Dec 18 '17 at 22:54
this is what I did: `df.set_index(['date','customer','product','type']).amt.unstack(-4).\ reindex(columns=pd.date_range(df['date'].min(), df['date'].max()),fill_value=0).\ stack(dropna=False).unstack().stack(dropna=False).\ unstack('customer').stack(dropna=False).reset_index().\ unstack('product').stack(dropna=False).reset_index().\ fillna(0).sort_values(['level_1','customer','product','type'])` but I got a stacking error. Is there anything else I need to add? – Dick Thompson Dec 18 '17 at 22:56
@DickThompson it is hard , for me to say without seeing the actual data , but since you know how stack and unstack work – BENY Dec 18 '17 at 23:01
it would literally be the same data but with another column that has 2 values- 'card' or 'cash'. I can't seem to figure it out, might be a minor formatting issue – Dick Thompson Dec 18 '17 at 23:03
@DickThompson change your level_1 to level_2 :-) – BENY Dec 18 '17 at 23:14
still the same issue- should I edit my question? just because I have 2 of these datasets – Dick Thompson Dec 18 '17 at 23:18
@DickThompson `df.set_index(['date','customer','product','type']).amt.unstack(-4).\ reindex(columns=pd.date_range(df['date'].min(), df['date'].max()),fill_value=0).\ stack(dropna=False).unstack().stack(dropna=False).\ unstack('customer').stack(dropna=False).reset_index().\ fillna(0)` after this you can sort the values – BENY Dec 18 '17 at 23:20
Sorry, still can't get to work? Telling me module 'pandas' has no attribute 'date_range' – Dick Thompson Dec 18 '17 at 23:30

MaxU - stand with Ukraine · Answer 2 · 2017-12-18T22:01:25.950

IIUC you can do it this way:

In [63]: dates = pd.date_range(df['date'].min(), df['date'].max())

In [64]: idx = pd.MultiIndex.from_product((dates,
                                           df['customer'].unique(), 
                                           df['product'].unique()))

In [72]: (df.set_index(['date','customer','product'])
            .reindex(idx, fill_value=0)
            .reset_index()
            .set_axis(df.columns, axis=1, inplace=False))
Out[72]:
         date customer product  amt
0  2017-01-01      tim   apple    3
1  2017-01-01      tim   melon    0
2  2017-01-01      tim  orange    0
3  2017-01-01      jim   apple    0
4  2017-01-01      jim   melon    2
5  2017-01-01      jim  orange    0
6  2017-01-01      tom   apple    5
7  2017-01-01      tom   melon    4
8  2017-01-01      tom  orange    0
9  2017-01-02      tim   apple    0
..        ...      ...     ...  ...
26 2017-01-03      tom  orange    0
27 2017-01-04      tim   apple    0
28 2017-01-04      tim   melon    3
29 2017-01-04      tim  orange    0
30 2017-01-04      jim   apple    2
31 2017-01-04      jim   melon    0
32 2017-01-04      jim  orange    0
33 2017-01-04      tom   apple    0
34 2017-01-04      tom   melon    1
35 2017-01-04      tom  orange    4

[36 rows x 4 columns]

score 1 · Answer 3 · answered Dec 18 '17 at 22:29

Maybe because of my SQL mindset, consider a left join merge on an expanded helper dataframe:

helper_df_list = [pd.DataFrame({'date': pd.date_range(df['date'].min(), df['date'].max()), 
                                'customer': c, 'product': p }) 
                    for c in df['customer'].unique() 
                            for p in df['product'].unique()]

helper_df = pd.concat(helper_df_list, ignore_index=True)

final_df = pd.merge(helper_df, df, on=['date', 'customer', 'product'], how='left')\
                    .fillna(0).sort_values(['date', 'customer']).reset_index(drop=True)

Output

print(final_df)
#    customer       date product  amt
# 0       jim 2017-01-01   apple  0.0
# 1       jim 2017-01-01   melon  2.0
# 2       jim 2017-01-01  orange  0.0
# 3       tim 2017-01-01   apple  3.0
# 4       tim 2017-01-01   melon  0.0
# 5       tim 2017-01-01  orange  0.0
# 6       tom 2017-01-01   apple  5.0
# 7       tom 2017-01-01   melon  4.0
# 8       tom 2017-01-01  orange  0.0
# 9       jim 2017-01-02   apple  0.0
# 10      jim 2017-01-02   melon  0.0
# 11      jim 2017-01-02  orange  0.0
# 12      tim 2017-01-02   apple  0.0
# 13      tim 2017-01-02   melon  0.0
# 14      tim 2017-01-02  orange  0.0
# 15      tom 2017-01-02   apple  0.0
# 16      tom 2017-01-02   melon  0.0
# 17      tom 2017-01-02  orange  0.0
# 18      jim 2017-01-03   apple  0.0
# 19      jim 2017-01-03   melon  0.0
# 20      jim 2017-01-03  orange  0.0
# 21      tim 2017-01-03   apple  0.0
# 22      tim 2017-01-03   melon  0.0
# 23      tim 2017-01-03  orange  0.0
# 24      tom 2017-01-03   apple  0.0
# 25      tom 2017-01-03   melon  0.0
# 26      tom 2017-01-03  orange  0.0
# 27      jim 2017-01-04   apple  2.0
# 28      jim 2017-01-04   melon  0.0
# 29      jim 2017-01-04  orange  0.0
# 30      tim 2017-01-04   apple  0.0
# 31      tim 2017-01-04   melon  3.0
# 32      tim 2017-01-04  orange  0.0
# 33      tom 2017-01-04   apple  0.0
# 34      tom 2017-01-04   melon  1.0
# 35      tom 2017-01-04  orange  4.0

Scott Boston · Answer 4 · 2017-12-18T22:51:25.757

Let's use product from itertools, pd.date_range, and merge:

from itertools import product

daterange = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
d1 = pd.DataFrame(list(product(daterange, 
                               df['customer'].unique(),
                               df['product'].unique())), 
                  columns=['date', 'customer', 'product'])
d1.merge(df, on=['date', 'customer', 'product'], how='left').fillna(0)

Output:

         date customer product  amt
0  2017-01-01      tim   apple  3.0
1  2017-01-01      tim   melon  0.0
2  2017-01-01      tim  orange  0.0
3  2017-01-01      jim   apple  0.0
4  2017-01-01      jim   melon  2.0
5  2017-01-01      jim  orange  0.0
6  2017-01-01      tom   apple  5.0
7  2017-01-01      tom   melon  4.0
8  2017-01-01      tom  orange  0.0
9  2017-01-02      tim   apple  0.0
10 2017-01-02      tim   melon  0.0
11 2017-01-02      tim  orange  0.0
12 2017-01-02      jim   apple  0.0
13 2017-01-02      jim   melon  0.0
14 2017-01-02      jim  orange  0.0
15 2017-01-02      tom   apple  0.0
16 2017-01-02      tom   melon  0.0
17 2017-01-02      tom  orange  0.0
18 2017-01-03      tim   apple  0.0
19 2017-01-03      tim   melon  0.0
20 2017-01-03      tim  orange  0.0
21 2017-01-03      jim   apple  0.0
22 2017-01-03      jim   melon  0.0
23 2017-01-03      jim  orange  0.0
24 2017-01-03      tom   apple  0.0
25 2017-01-03      tom   melon  0.0
26 2017-01-03      tom  orange  0.0
27 2017-01-04      tim   apple  0.0
28 2017-01-04      tim   melon  3.0
29 2017-01-04      tim  orange  0.0
30 2017-01-04      jim   apple  2.0
31 2017-01-04      jim   melon  0.0
32 2017-01-04      jim  orange  0.0
33 2017-01-04      tom   apple  0.0
34 2017-01-04      tom   melon  1.0
35 2017-01-04      tom  orange  4.0

Pandas- adding missing dates to DataFrame while keeping column/index values?

4 Answers4

Linked