6

Some data are collected every 5 seconds and sometimes are missing.

After loading them into a Pandas dataframe, I want to define a starting point in time and extract exactly 180 rows (15 minutes x 12 samples per minute), whatever the starting point. These data feed a plot, and keeping always the same size simplifies a lot the rest of the code.

The missing data should be filled with None.

I assume there are shortcuts to do this that I am not aware of:

import pandas as pd
import datetime

dt = [
    "2018-02-08 13:45:05",
    "2018-02-08 13:45:10",
    "2018-02-08 13:45:25",
    "2018-02-08 13:45:30",
    "2018-02-08 13:45:35",
    "2018-02-08 13:45:40",
    "2018-02-08 13:45:50",
    "2018-02-08 13:45:55",
    "2018-02-08 13:46:00",
    "2018-02-08 13:46:05",
]

wl = [
    4737.25,
    4834.80,
    4885.53,
    5003.98,
    5031.08,
    5215.90,
    5147.65,
    5100.50,
    5038.94,
    5020.67,
]

df = pd.DataFrame({"dt":dt, "wl":wl}).set_index("dt")
df.index = pd.to_datetime(df.index)
df = df.resample("5s").mean()
print(df)

that returns:

                          wl
dt                          
2018-02-08 13:45:05  4737.25
2018-02-08 13:45:10  4834.80
2018-02-08 13:45:15      NaN
2018-02-08 13:45:20      NaN
2018-02-08 13:45:25  4885.53
2018-02-08 13:45:30  5003.98
2018-02-08 13:45:35  5031.08
2018-02-08 13:45:40  5215.90
2018-02-08 13:45:45      NaN
2018-02-08 13:45:50  5147.65
2018-02-08 13:45:55  5100.50
2018-02-08 13:46:00  5038.94
2018-02-08 13:46:05  5020.67

This is ok, but datetime range is defined by the first and last sample's datetime.

The datetime range I am interested in is instead:

new_datetime_range = pd.date_range(start=df.index.min(), freq="5s", periods=180)
print(new_datetime_range)

up to '2018-02-08 14:00:00'.

I am trying to obtain

                          wl
dt                          
2018-02-08 13:45:05  4737.25
2018-02-08 13:45:10  4834.80
2018-02-08 13:45:15      NaN
2018-02-08 13:45:20      NaN
2018-02-08 13:45:25  4885.53
2018-02-08 13:45:30  5003.98
2018-02-08 13:45:35  5031.08
2018-02-08 13:45:40  5215.90
2018-02-08 13:45:45      NaN
2018-02-08 13:45:50  5147.65
2018-02-08 13:45:55  5100.50
2018-02-08 13:46:00  5038.94
2018-02-08 13:46:05  5020.67
2018-02-08 13:46:10      Nan
2018-02-08 13:46:15      Nan
............................
2018-02-08 13:59:45      Nan
2018-02-08 13:59:50      Nan
2018-02-08 13:59:55      Nan
2018-02-08 14:00:00      Nan

How is possible to do this?

Alex Poca
  • 2,406
  • 4
  • 25
  • 47

1 Answers1

3

I think you need reindex:

df = df.resample("5s").mean().reindex(new_datetime_range)

Another solution is add last date manually to index:

last = pd.date_range(start=df.index.min(), freq="5s", periods=180)[-1]
df.loc[last] = np.nan
df = df.resample("5s").mean()

print(df)
                          wl
2018-02-08 13:45:05  4737.25
2018-02-08 13:45:10  4834.80
2018-02-08 13:45:15      NaN
2018-02-08 13:45:20      NaN
2018-02-08 13:45:25  4885.53
2018-02-08 13:45:30  5003.98
2018-02-08 13:45:35  5031.08
2018-02-08 13:45:40  5215.90
2018-02-08 13:45:45      NaN
2018-02-08 13:45:50  5147.65
2018-02-08 13:45:55  5100.50
2018-02-08 13:46:00  5038.94
2018-02-08 13:46:05  5020.67
2018-02-08 13:46:10      NaN
2018-02-08 13:46:15      NaN
...
...
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thank you @jezrael! It works perfectly. I have seen reindex in the past, but never understood how to use it properly. Also the second solution is smart, just taking in account that the last value is not being replaced by NaN. – Alex Poca Feb 08 '18 at 15:09