46

I've a data frame that looks like the following

x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]})

What I would like to be able to do is find the minimum and maximum date within the date column and expand that column to have all the dates there while simultaneously filling in 0 for the val column. So the desired output is

            dt user  val
0   2016-01-01    a    1
1   2016-01-02    a   33
2   2016-01-03    a    0
3   2016-01-04    a    0
4   2016-01-05    a    0
5   2016-01-06    a    0
6   2016-01-01    b    0
7   2016-01-02    b    0
8   2016-01-03    b    0
9   2016-01-04    b    0
10  2016-01-05    b    2
11  2016-01-06    b    1

I've tried the solution mentioned here and here but they aren't what I'm after. Any pointers much appreciated.

broccoli
  • 4,738
  • 10
  • 42
  • 54

3 Answers3

46

Initial Dataframe:

            dt  user    val
0   2016-01-01     a      1
1   2016-01-02     a     33
2   2016-01-05     b      2
3   2016-01-06     b      1

First, convert the dates to datetime:

x['dt'] = pd.to_datetime(x['dt'])

Then, generate the dates and unique users:

dates = x.set_index('dt').resample('D').asfreq().index

>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', name='dt', freq='D')

users = x['user'].unique()

>> array(['a', 'b'], dtype=object)

This will allow you to create a MultiIndex:

idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])

>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
           names=['dt', 'user'])

You can use that to reindex your DataFrame:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out: 
           dt user  val
0  2016-01-01    a    1
1  2016-01-01    b    0
2  2016-01-02    a   33
3  2016-01-02    b    0
4  2016-01-03    a    0
5  2016-01-03    b    0
6  2016-01-04    a    0
7  2016-01-04    b    0
8  2016-01-05    a    0
9  2016-01-05    b    2
10 2016-01-06    a    0
11 2016-01-06    b    1

which then can be sorted by users:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out: 
           dt user  val
0  2016-01-01    a    1
2  2016-01-02    a   33
4  2016-01-03    a    0
6  2016-01-04    a    0
8  2016-01-05    a    0
10 2016-01-06    a    0
1  2016-01-01    b    0
3  2016-01-02    b    0
5  2016-01-03    b    0
7  2016-01-04    b    0
9  2016-01-05    b    2
11 2016-01-06    b    1
C. Braun
  • 5,061
  • 19
  • 47
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • 1
    In this case, all user share the same time range. What if each user has different date range [min, max]? – luoshao23 Oct 26 '18 at 02:38
  • @luoshao23 `.MultiIndex.from_product` is a convenience function for the cartesian product but the index can also be constructed in other ways too. What kind of data do you have for this? How do you store min and max for each user for example? Does it come from a DataFrame? – ayhan Oct 26 '18 at 17:34
  • @ayhan Yes, I make a `date_range` using min/max value, and then use `.MultiIndex.from_product` to generate multi level index for each user. Finally, I `concat` all the DataFrames to one DataFrame. It works while it takes much time. Do you have any idea to improve the efficiency? – luoshao23 Oct 27 '18 at 14:15
  • @luoshao23 If all of them are coming from a single DataFrame, generating the multiindex separately and using reindex would probably more efficient. You can use `MultiIndex.from_tuples` for example. I think it would be better if you post a new question with a sample DataFrame. – ayhan Oct 27 '18 at 21:30
  • I searched for many hours trying to convert a dataframe index freq from None to H. This worked! Thanks. `dates = x.set_index('dt').resample('H').asfreq().index` – Kip Mar 20 '19 at 19:54
  • This doesn't work when we have duplicate values of 'dt' in different groups. Ex: 2016-01-01 is present both for user- a & b. Below answer (@piRSquared) addreses for that as well. It's a one-liner with stack & unstack. – user2458552 Oct 09 '20 at 16:33
42

As @ayhan suggests

x.dt = pd.to_datetime(x.dt)

One-liner using mostly @ayhan's ideas while incorporating stack/unstack and fill_value

x.set_index(
    ['dt', 'user']
).unstack(
    fill_value=0
).asfreq(
    'D', fill_value=0
).stack().sort_index(level=1).reset_index()

           dt user  val
0  2016-01-01    a    1
1  2016-01-02    a   33
2  2016-01-03    a    0
3  2016-01-04    a    0
4  2016-01-05    a    0
5  2016-01-06    a    0
6  2016-01-01    b    0
7  2016-01-02    b    0
8  2016-01-03    b    0
9  2016-01-04    b    0
10 2016-01-05    b    2
11 2016-01-06    b    1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 3
    This solution is better than the accepted answer. It works even if there is a duplicate date across user. For example, if there was another row with "2016-01-01" and user "b", accepted solution would fail with an error - cannot reindex from a duplicate axis. But this solution works for such scenario too – user131476 Aug 17 '20 at 04:29
  • How will this change if we need to do it using the min/max dates within the group ('user' in this case)? Basically generating/filling continuous dates within the group only. – Vivek Payasi Jan 24 '21 at 15:48
2

An old question, with already excellent answers; this is an alternative, using the complete function from pyjanitor that could help with the abstraction when generating explicitly missing rows:

#pip install pyjanitor
import pandas as pd
import janitor as jn

 x['dt'] = pd.to_datetime(x['dt'])

# generate complete list of dates
dates = dict(dt = pd.date_range(x.dt.min(), x.dt.max(), freq='1D'))

# build the new dataframe, and fill nulls with 0
x.complete('user', dates, fill_value = 0)

   user         dt  val
0     a 2016-01-01    1
1     a 2016-01-02   33
2     a 2016-01-03    0
3     a 2016-01-04    0
4     a 2016-01-05    0
5     a 2016-01-06    0
6     b 2016-01-01    0
7     b 2016-01-02    0
8     b 2016-01-03    0
9     b 2016-01-04    0
10    b 2016-01-05    2
11    b 2016-01-06    1

sammywemmy
  • 27,093
  • 4
  • 17
  • 31