0

I have a dataframe, full of hourly data, which has missing values. The dates act as the index and are laid out as yyyy-mm-dd hh:mm.

For the context I'm working in, it isn't appropriate to simply mirror the value above. Hence ffill won't suffice. It would be better to mirror the values for the same hour from the day before.

So if 10:00 the day before has a value of "red", the missing data is filed with the value of "red".

If anybody can help me do this, they will make my day! :)

Date Time          |        Yeovilton
01/01/2012 00:00   |           12.4
01/01/2012 01:00   |           11.7
...
...
02/01/2012 00:00   |           5.9
01/01/2012 01:00   |           NaN
kosherdiah
  • 33
  • 8

1 Answers1

1

Group your data by hour and fill on the groups:

ts.groupby(ts.index.hour).fillna(method='ffill')

Your problem is that, as you point out, ffill operates sequentially, and your data aren't in the sequence you want to fill with. But since your index is already a timestamp, you can extract the hour pretty easily, group with it, and fill inside the groups.

To demonstrate that this works (and show how you could make sample data for this):

import pandas as pd
import numpy as np

timestamps = [pd.Timestamp(t) for t in ['2011-01-01 10:00:00', '2011-01-01 12:00:00', '2011-01-02 10:00:00']]
colors = ['red', 'blue', np.nan]
ts = pd.Series(colors, index=timestamps)

print ts

# 2011-01-01 10:00:00     red
# 2011-01-01 12:00:00    blue
# 2011-01-02 10:00:00     NaN
# dtype: object

print ts.ffill()

# 2011-01-01 10:00:00     red
# 2011-01-01 12:00:00    blue
# 2011-01-02 10:00:00    blue
# dtype: object

print ts.groupby(ts.index.hour).ffill()

# 2011-01-01 10:00:00     red
# 2011-01-01 12:00:00    blue
# 2011-01-02 10:00:00     red
# dtype: object
ASGM
  • 11,051
  • 1
  • 32
  • 53
  • Thanks :) I'm struggling on the best way to include the sample data. Without markdown for tables, 48 hours of data is going to upload as a mess. Have you any suggestions? – kosherdiah Jul 18 '17 at 09:47
  • As a general tip, I'd suggest limiting your sample data to the bare minimum necessary to replicate the problem. In your case, you could get by with four or five rows where the index is a timestamp which has at least two non-sequential rows with the same hour, and where the second of those rows has a missing value. – ASGM Jul 18 '17 at 09:50
  • @kosherdiah Here's a good thread on generating sample data: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – ayhan Jul 18 '17 at 09:52
  • That looks great, ASGM - thanks! Something's just come up, but I'll check if this works for me and accept your answer once I come back. You may have just made my day :) – kosherdiah Jul 18 '17 at 10:06
  • I'm adding something which helped me, hopefully it will come in handy for someone else too. I already had a column full of dates, which I needed to convert to datetime objects. I used: `df["Date Time"] = pd.to_datetime(df["Date Time"])` to convert the column. The column was called "Date Time" in my dataframe – kosherdiah Jul 25 '17 at 10:58