I read from a remote device some data every 5 seconds.
They are saved as:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
2018-01-01 00:00:15 2
2018-01-01 00:00:20 3
2018-01-01 00:00:25 4
2018-01-01 00:00:30 3
2018-01-01 00:00:35 2
2018-01-01 00:00:40 4
2018-01-01 00:00:45 5
2018-01-01 00:00:50 3
2018-01-01 00:00:55 3
Alas communication is not the best, and sometimes communication is not working.
In that case the remote device will provide the cumulative value of the data as soon as possible.
The previous data could be saved as:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
.......... 00:00:15 missing...
.......... 00:00:20 missing...
.......... 00:00:25 missing...
2018-01-01 00:00:30 12 <--- sum of the last 4 readings
2018-01-01 00:00:35 2
.......... 00:00:40 missing...
.......... 00:00:45 missing...
2018-01-01 00:00:50 15 <--- sum of the last 3 readings
2018-01-01 00:00:55 3
I need to fill all missing rows and remove the peaks in the original data with the mean value calculated on the peak.
Resampling is easy:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
2018-01-01 00:00:15 NaN
2018-01-01 00:00:20 NaN
2018-01-01 00:00:25 NaN
2018-01-01 00:00:30 12
2018-01-01 00:00:35 2
2018-01-01 00:00:40 NaN
2018-01-01 00:00:45 NaN
2018-01-01 00:00:50 15
2018-01-01 00:00:55 3
but how to fill NaN and remove the peaks?
I checked the various methods for asfreq
and resample
but none of them (bfill
, ffill
) is useful in this case.
The final result should be:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
2018-01-01 00:00:15 3 <--- NaN filled with mean = peak 12/4 rows
2018-01-01 00:00:20 3 <--- NaN filled with mean
2018-01-01 00:00:25 3 <--- NaN filled with mean
2018-01-01 00:00:30 3 <--- peak changed
2018-01-01 00:00:35 2
2018-01-01 00:00:40 5 <--- NaN filled with mean = peak 15/3 rows
2018-01-01 00:00:45 5 <--- NaN filled with mean
2018-01-01 00:00:50 5 <--- peak changed
2018-01-01 00:00:55 3
Dataframes I used for testing:
import numpy as np
import pandas as pd
time = pd.date_range(start='2021-01-01', freq='5s', periods=12)
read_data = pd.Series([2, 3, 3, np.nan, np.nan, np.nan, 12, 2, np.nan, np.nan, 15, 3], index=time).dropna()
read_data.asfreq("5s")