237

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.

idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()

In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013 However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:

fig, ax = plt.subplots()    
ax.bar(idx.to_pydatetime(), s, color='green')

What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?

Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.

09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1
KHibma
  • 2,927
  • 4
  • 20
  • 25

7 Answers7

410

You could use Series.reindex:

import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

yields

2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
...
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 39
    `reindex` is an amazing function. It can (1) reorder existing data to match a new set of labels, (2) insert new rows where no label previously existed, (3) fill data for missing labels, (including by forward/backward filling) (4) select rows by label! – unutbu Oct 11 '13 at 18:36
  • @unutbu This answers part of a question I had too, thanks! But was wondering if you knew how to dynamically create a list of with the dates which have events? – Nick Duddy Jun 05 '17 at 12:08
  • 2
    There is one problem (or bug) with reindex though: it doesn't work with dates before 1/1/1970, so in this case df.resample() works perfectly. – Sergey Gulbin Jul 27 '17 at 04:35
  • 5
    you may use this instead for idx to skip entering start and end dates manually: `idx = pd.date_range(df.index.min(), df.index.max())` – Reveille Apr 03 '20 at 15:22
  • 2
    Dropping the link to the documentation here, to save you the search: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html – Harm Jul 08 '20 at 14:57
  • is there a way to keep the index's name when running this function? – user147529 Aug 06 '20 at 09:42
  • for `df.reindex` with multiple index levels you can use `pd.MultiIndex.from_product` see https://stackoverflow.com/questions/53286882/how-to-reindex-a-multiindex-dataframe – citynorman Sep 12 '21 at 18:36
  • is it possible to add the 'interpolation' method while doing this? – Nihat Oct 26 '21 at 17:10
  • 1
    reindex does not work at least anymore – intergallactic Nov 23 '22 at 10:22
94

A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().

# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'), 
                  pd.Timestamp('2012-05-04'), 
                  pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)

print(s.asfreq('D'))
2012-05-01    1.0
2012-05-02    NaN
2012-05-03    NaN
2012-05-04    2.0
2012-05-05    NaN
2012-05-06    3.0
Freq: D, dtype: float64
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • 3
    I really prefer this method; you avoid having to call `date_range` since it implicitly uses the first and last index as the start and end (which is what you would almost always want). – Michael Hays Oct 02 '18 at 20:31
  • Very clean and professional method. Works well with using interpolate afterwards as well. – msarafzadeh Jul 12 '20 at 09:08
  • I second this. This is also a great method to use before merging two dataframes of different index length where joins, merges etc. almost always leads to errors such as a column full of NaNs. – user3661992 Sep 07 '20 at 10:24
  • 2
    Thanks for your answer but I still have a question. Given that I want to start on date x-x-x and end on date y-y-y and on my dataset 's' I have dates e-e-e to f-f-f, that are between dates x-x-x and y-y-y. Using "asfreq" how can I fill the dates on my dataset 's' from x-x-x to y-y-y? I have not found on the docs. Thank you – Catarina Nogueira May 28 '21 at 14:59
  • Yes I used this method to insert NaN in missing dates in a dataframe before plotting it with matplatlib – PerseP Jan 05 '23 at 09:59
38

An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:

df.resample('D').mean()

resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.

Here is the original data, but with an extra entry for '2013-09-03':

             val
date           
2013-09-02     2
2013-09-03    10
2013-09-03    20    <- duplicate date added to OP's data
2013-09-06     5
2013-09-07     1

And here are the results:

             val
date            
2013-09-02   2.0
2013-09-03  15.0    <- mean of original values for 2013-09-03
2013-09-04   NaN    <- NaN b/c date not present in orig
2013-09-05   NaN    <- NaN b/c date not present in orig
2013-09-06   5.0
2013-09-07   1.0

I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.

JohnE
  • 29,156
  • 8
  • 79
  • 109
37

One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:

df = pd.DataFrame({
    'timestamps': pd.to_datetime(
        ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
    'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-18  "2016-11-18 04:00:00"  d

Due to the duplicate 2016-11-16 date, an attempt to reindex:

all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)

fails with:

...
ValueError: cannot reindex from a duplicate axis

(by this it means the index has duplicates, not that it is itself a dup)

Instead, we can use .loc to look up entries for all dates in range:

df.loc[all_days]

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-17  NaN                    NaN
2016-11-18  "2016-11-18 04:00:00"  d

fillna can be used on the column series to fill blanks if needed.

Nick Edgar
  • 1,269
  • 1
  • 10
  • 7
  • Any idea on what to do if Date column contains `Blanks` or `NULLS`? `df.loc[all_days]` won't work in that case. – Furqan Hashim Feb 23 '19 at 15:44
  • 2
    Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike – Dmitrii Magas Mar 05 '19 at 11:53
10

Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:

def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):

    df.set_index(date_col_name,drop=True,inplace=True)
    df.index = pd.DatetimeIndex(df.index)
    d = datetime.now().date()
    d2 = d - timedelta(days = days_back)
    idx = pd.date_range(d2, d, freq = "D")
    df = df.reindex(idx,fill_value=fill_value)
    df[date_col_name] = pd.DatetimeIndex(df.index)

    return df
Midavalo
  • 469
  • 2
  • 18
  • 29
eiTan LaVi
  • 2,901
  • 24
  • 15
2

You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.

# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
    'date':pd.to_datetime([
        '2022-02-10'
        ,'2022-02-11'
        ,'2022-02-14'
        ,'2022-02-14'
        ,'2022-02-24'
        ,'2022-02-16'
    ])
    ,'value':[10,20,5,10,15,30]
})

# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])

# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
Sylvester Kruin
  • 3,294
  • 5
  • 16
  • 39
1
s.asfreq('D').interpolate().asfreq('Q')
thistleknot
  • 1,098
  • 16
  • 38