11

I have Pandas DataFrame (loaded from .csv) with Date-time as index.. where there is/have-to-be one entry per day. The problem is that I have gaps i.e. there is days for which I have no data at all. What is the easiest way to insert rows (days) in the gaps ? Also is there a way to control what is inserted in the columns as data ! Say 0 OR copy the prev day info OR to fill sliding increasing/decreasing values in the range from prev-date toward next-date data-values.

thanks

Here is example 01-03 and 01-04 are missing :

In [60]: df['2015-01-06':'2015-01-01']
Out[60]: 
           Rate  High (est)  Low (est)
Date                                      
2015-01-06  1.19643      0.0000     0.0000
2015-01-05  1.20368      1.2186     1.1889
2015-01-02  1.21163      1.2254     1.1980
2015-01-01  1.21469      1.2282     1.2014

Still experimenting but this seems to solve the problem :

df.set_index(pd.DatetimeIndex(df.Date),inplace=True)

and then resample... the reason being that importing the .csv with header-col-name Date, is not actually creating date-time-index, but Frozen-list whatever that means. resample() is expecting : if isinstance(ax, DatetimeIndex): .....


Here is my final solution :

  #make dates the index
  self.df.set_index(pd.DatetimeIndex(self.df.Date), inplace=True)
  #fill the gaps
  self.df = self.df.resample('D',fill_method='pad')
  #fix the Date column
  self.df.Date = self.df.index.values

I had to fix the Date column, because resample() just allow you to pad-it. It fixes the index correctly though, so I could use it to fix the Date column.

Here is snipped of the data after correction :

2015-01-29 2015-01-29  1.13262      0.0000     0.0000
2015-01-30 2015-01-30  1.13161      1.1450     1.1184
2015-01-31 2015-01-31  1.13161      1.1450     1.1184
2015-02-01 2015-02-01  1.13161      1.1450     1.1184

01-30, 01-31 are the new generated data.

sten
  • 7,028
  • 9
  • 41
  • 63
  • 1
    df.index should be a `DatetimeIndex`, I recommend using `pd.to_datetime` rather than the pd.DatetimeIndex constructor. `df.index.names` is the "index header" (the columns are labelled, e.g. "Rate", so can the levels of the index). – Andy Hayden Feb 12 '15 at 01:41
  • can I do that during cvs import (pd.read_csv()) – sten Feb 12 '15 at 02:35
  • Yes, `parse_dates=True` may do the trick. – Andy Hayden Feb 12 '15 at 05:45
  • Your first example shows that 01/03 and 01/04 were missing, but your "data after correction" shows a completely different range, claiming that 1/30 and 1/31 were newly generated. Seems like you would want to show the same time period before and after. – nategood Apr 17 '15 at 16:59
  • Update: `fill_method="pad"` has since been deprecated. Use `.resample(...).pad()` instead – Manto Aug 05 '18 at 01:50

1 Answers1

14

You'll could resample by day e.g. using mean if there are multiple entries per day:

df.resample('D', how='mean')

You can then ffill to replace NaNs with the previous days result.

See up and down sampling in the docs.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Now you've updated with ohlc, I'm not sure if you can how='ohlc' on an already ohlc DataFrame (there is an issue about that though), you can resample each column individually (i.e. high with how='max', low with how='min', rate with how='mean'). That's if you have multiple entries for a day, if you don't its all good. – Andy Hayden Feb 11 '15 at 23:50
  • Thanks.. I'm getting : TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex ?? when I try it ... df.index.names gives back : FrozenList([u'Date']) , if it means something .. – sten Feb 12 '15 at 00:58
  • 2
    @user1019129 your index is not a DatetimeIndex, you can make it so with `df.index = pd.to_datetime(df.index)`. – Andy Hayden Feb 12 '15 at 01:38
  • 1
    Related. I was looking to fill daily gaps and interpret them as 0s (e.g. there was no data to collect that day). `fillna` does the trick. `df2 = df.resample('D', how='mean').fillna(0)` – nategood Apr 17 '15 at 17:09