Resampling dataframe in pandas as a checking operation

Question

I have a DataFrame like this:

                            A        B    value
2014-11-14 12:00:00      30.5    356.3      344
2014-11-15 00:00:00      30.5    356.3      347
2014-11-15 12:00:00      30.5    356.3      356
2014-11-16 00:00:00      30.5    356.3      349
...
2017-01-06 00:00:00      30.5    356.3      347

I want to check if the index is running every 12 hours, perhaps there is some data missing, so there can be a jump of 24 or more hours. In that case I want to introduce nan in the value column and copy the values from columns A and B.

I thought of using resample:

df = df.resample('12H')

but I don't know how to handles the different columns or if this is the right approach.

EDIT: If there is a value missing, for instance in 2015-12-12 12:00:00 I would like to add a row like this:

...
2015-12-12 00:00:00     30.5    356.3    323
2015-12-12 12:00:00     30.5    356.3    NaN  *<- add this*
2015-12-13 00:00:00     30.5    356.3    347
...

Sorry are you after `df = df.resample('12H').ffill()`? can you show what the desired result should look like — EdChum, Jan 11 '17 at 15:03

Ted Petrou · Accepted Answer · 2017-01-11T15:21:42.670

2

You can use the asfreq method to produce evenly spaced indexes every 12 hours which will automatically put np.nan values for every jump. Then you can just forward fill columns A and B.

df1= df.asfreq('12H')
df1[['A','B']] = df1[['A','B']].fillna(method='ffill')

edited Jan 11 '17 at 15:21

answered Jan 11 '17 at 15:15

Ted Petrou

59,042
19
131
136

With `asfreq` it works, as you said, but with `df1 = df.resample('12H')` is not working, and I think it should be the same. Am I missing something? (I get the `ValueError: cannot set items on DatetimeIndexResampler`. – David Jan 12 '17 at 10:33
1

`resample` only creates a resample object. It's very analogous to what `groupby` does. You must act on that object to produce a dataframe. Like `df.resample('12H').sum()' or something similar. – Ted Petrou Jan 12 '17 at 12:03

score 1 · Answer 2 · edited May 23 '17 at 11:45

I would go for simply sorting your dataframe on the index and create a new column that takes the value from the next row (for the time). The current time would be called "from" and the time from the next time would be called "to".

Next step would be to use the two columns ("from" and "to") to create a column containing a list of values between this row and next row for every 12 hours (a range basically).

Final step would be to "explode" every line for each value in the range. Look at How to explode a list inside a Dataframe cell into separate rows

Hope this helps :)

Resampling dataframe in pandas as a checking operation

2 Answers2

Linked