4

I have data that is every 15 seconds. But, there are some values that are missing. These are not tagged with NaN, but simply are not present. How can I fill in those values?
I have tried to resample, but that also shifts my original data. So, why doesn't this work:

a=pd.Series([1.,3.,4.,3.,5.],['2016-05-25 00:00:35','2016-05-25 00:00:50','2016-05-25 00:01:05','2016-05-25 00:01:35','2016-05-25 00:02:05'])                                   
a.index=pd.to_datetime(a.index)
a.resample('15S').mean()

In [368]: a
Out[368]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:35    3.0
2016-05-25 00:02:05    5.0
dtype: float64

It shows me this:

2016-05-25 00:00:30    1.0
2016-05-25 00:00:45    3.0
2016-05-25 00:01:00    4.0
2016-05-25 00:01:15    NaN
2016-05-25 00:01:30    3.0
2016-05-25 00:01:45    NaN
2016-05-25 00:02:00    5.0
Freq: 15S, dtype: float64

So, I no longer have a value at 00:35 or 00:50.
For my original larger data set, I also end up seeing many NaN value in large groups at the end of the resampled data.
What I would like to do resample my 15s data, to 15s, so whenever there is no data present for a particular time it should use the mean of the values around it to fill it in. Is there a way to do that?
Also, why does the time basis change when I resample? My original data starts at 00:00:35 and after resampling it starts at 00:30? It seems like it got shifted by 5 seconds.
In my example data, all it should have done is created an additional data entry at 00:01:50.


Edit

I realized that my data is slightly more complex then I had thought. The 'base' actually changes part way through it. If I use the solution below, then it works for part of the data, but then the values stop changing. For example:

a = pd.Series([1.,3.,4.,3.,5.,6.,7.,8.], ['2016-05-25 00:00:35','2016-05-25 00:00:50','2016-05-25 00:01:05','2016-05-25 00:01:35','2016-05-25 00:02:05','2016-05-25 00:03:00','2016-05-25 00:04:00','2016-05-25 00:06:00'])                                   

In [79]: a
Out[79]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:35    3.0
2016-05-25 00:02:05    5.0
2016-05-25 00:03:00    6.0
2016-05-25 00:04:00    7.0
2016-05-25 00:06:00    8.0
dtype: float64

In [80]: a.index = pd.to_datetime(a.index)

In [81]: a.resample('15S', base=5).interpolate()
Out[81]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:20    3.5
2016-05-25 00:01:35    3.0
2016-05-25 00:01:50    4.0
2016-05-25 00:02:05    5.0
2016-05-25 00:02:20    5.0
2016-05-25 00:02:35    5.0
2016-05-25 00:02:50    5.0
2016-05-25 00:03:05    5.0
2016-05-25 00:03:20    5.0
2016-05-25 00:03:35    5.0
2016-05-25 00:03:50    5.0
2016-05-25 00:04:05    5.0
2016-05-25 00:04:20    5.0
2016-05-25 00:04:35    5.0
2016-05-25 00:04:50    5.0
2016-05-25 00:05:05    5.0
2016-05-25 00:05:20    5.0
2016-05-25 00:05:35    5.0
2016-05-25 00:05:50    5.0
Freq: 15S, dtype: float64

As you can see it stops interpolating after 2:05, and seems to ignore the data at 3:00,4:00 and 5:00.

IanS
  • 15,771
  • 9
  • 60
  • 84
Adam
  • 742
  • 1
  • 6
  • 23
  • 1
    You should use the `base` argument of `resample`. – IanS Sep 20 '16 at 16:29
  • Can one of you help explain "base"? What are the units? I see that using base=5 does the job, but i've read the resample docs, and it really is not clear why 5 is correct and not 35 for example? As well as loffset? – Adam Sep 20 '16 at 17:10
  • This looks similar to what I have been trying to do. But is complex and inefficient for large data: https://stackoverflow.com/questions/25234941/python-regularise-irregular-time-series-with-linear-interpolation – Adam Sep 21 '16 at 18:07
  • 1
    But then it's not clear what you want. Can you show your desired output is with the new example? Ideally you would ask a new question, as this one has become too confused. You should re-accept Alberto'/s answer, and ask a new question with your new example and your desired output. – IanS Sep 22 '16 at 07:51
  • I would like to return an exact value when the input time matches the output time, and an interpolated value when the output time is between two input times. So 2:05 should be 5.0 but 2:20 should be about 5.25. and 2:35 should be about 5.5 (these are just approx, not the exact answer). – Adam Sep 22 '16 at 13:06
  • To put it another way, I have somewhat irregular data. I want Pandas to give me a "best guess" every 15 second. Clearly, the best answer is the exact one, when the times match up. Otherwise, I want to use interpolate to that time stamp for a best guess (although, there are other reasonable algorithms that could be used as well) – Adam Sep 22 '16 at 13:22
  • So you want your final timestamps to be regular, i.e. every 15 seconds? – IanS Sep 22 '16 at 14:21
  • Yes, that's exactly what I'm trying to do. With a local interpolation, when necessary, and the exact point, when it is available. – Adam Sep 22 '16 at 19:21
  • Weird, `a.resample('15S', base=5).interpolate()` works for me. Which pandas version are you using? – IanS Sep 23 '16 at 08:21
  • 1
    Note that the code above does the resampling first, then the interpolation, which means that the value at `2:50` will be exactly the original value at `3:00`. If that's not what you want, then you will probably have to upsample to 5-seconds frequency first, interpolate, then downsample back to 15-seconds. Again, I recommend you ask a new question. – IanS Sep 23 '16 at 08:23
  • @IanS I'm using 0.18.1 and I get the results shown above. Which version are you using, and what results do you get? – Adam Sep 23 '16 at 13:20
  • Good point, I use version 0.18.0, and with 0.18.1 I see the same result as you. You should report a bug here: https://github.com/pydata/pandas/issues (I don't think it has been reported before). – IanS Sep 23 '16 at 13:38
  • 1
    @IanS Done: https://github.com/pydata/pandas/issues/14297 – Adam Sep 25 '16 at 15:42

4 Answers4

5

Both @IanS and @piRSquared address the shifting of the base. As for filling NaNs: pandas has methods for forward-filling (.ffill()/.pad()) and backward-filling (.bfill()/.backfill()), but not for taking the mean. A quick way of doing it is by taking the mean manually:

b = a.resample('15S', base=5)
(b.ffill() + b.bfill()) / 2

Output:

2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:20    3.5
2016-05-25 00:01:35    3.0
2016-05-25 00:01:50    4.0
2016-05-25 00:02:05    5.0
Freq: 15S, dtype: float64

EDIT: I stand corrected: there is a built-in method: .interpolate().

a.resample('15S', base=5).interpolate()
Alicia Garcia-Raboso
  • 13,193
  • 1
  • 43
  • 48
  • Good point. This [answer](http://stackoverflow.com/a/18691949/5276797) should work too: `df.fillna(df.mean())`. – IanS Sep 20 '16 at 16:45
  • Can one of you help explain "base"? What are the units? I see that using base=5 does the job, but i've read the resample docs, and it really is not clear why 5 is correct and not 35 for example? As well as loffset? – Adam Sep 20 '16 at 17:49
  • In this example, `base` is an integer between 0 and 14 that makes the timeseries begin at midnight past `base` seconds. The argument itself is an integer, but the units are implicitly the frequency's unit. – IanS Sep 21 '16 at 08:50
  • @Adam (not sure if you'd be notified otherwise) – IanS Sep 21 '16 at 08:50
  • @IanS it turns out that this solution only works for part of the data. So, I expanded the question some. – Adam Sep 21 '16 at 17:48
3

For the sake of completeness, the base argument works too:

a.resample('15S', base=5).mean()
Out[4]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:20    NaN
2016-05-25 00:01:35    3.0
2016-05-25 00:01:50    NaN
2016-05-25 00:02:05    5.0
Freq: 15S, dtype: float64
IanS
  • 15,771
  • 9
  • 60
  • 84
2

you need to use the loffset argument

a.resample('15S', loffset='5S')

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

An answer was posted to my bug report that I wanted to share here for completeness. It is not my post, but does just what I had wanted:

Try this (maybe this is what interpolate should do by default, interpolating before re-sampling?)

from scipy.interpolate import interp1d

# fit the interpolation in integer ns-space
f = interp1d(a.index.asi8, a.values)

# generating ending bins
dates = a.resample('15s', base=5).first().index

# apply
pd.Series(f(dates.asi8), dates)
Out[122]: 
2016-05-25 00:00:35    1.000000
2016-05-25 00:00:50    3.000000
2016-05-25 00:01:05    4.000000
2016-05-25 00:01:20    3.500000
2016-05-25 00:01:35    3.000000
2016-05-25 00:01:50    4.000000
2016-05-25 00:02:05    5.000000
2016-05-25 00:02:20    5.272727
2016-05-25 00:02:35    5.545455
2016-05-25 00:02:50    5.818182
2016-05-25 00:03:05    6.083333
2016-05-25 00:03:20    6.333333
2016-05-25 00:03:35    6.583333
2016-05-25 00:03:50    6.833333
2016-05-25 00:04:05    7.041667
2016-05-25 00:04:20    7.166667
2016-05-25 00:04:35    7.291667
2016-05-25 00:04:50    7.416667
2016-05-25 00:05:05    7.541667
2016-05-25 00:05:20    7.666667
2016-05-25 00:05:35    7.791667
2016-05-25 00:05:50    7.916667
Freq: 15S, dtype: float64
Adam
  • 742
  • 1
  • 6
  • 23