Here are the steps I've taken so far. I am trying to get the daily PM averages of my dataframe, with a column of values 'PM'
.
import pandas as pd
import numpy as np
df_2018 = pd.read_csv('kath2018.csv')
My 'kath2018.csv' looks like this:
df_2018.head()
Date Year Month Day Hour PM
0 1/1/18 1:00 2018 1 1 1 131
1 1/1/18 2:00 2018 1 1 2 85
2 1/1/18 3:00 2018 1 1 3 74
3 1/1/18 4:00 2018 1 1 4 79
4 1/1/18 5:00 2018 1 1 5 85
I cleanup the data by replacing missing null values with np.NaN
, and then using pd.interpolate
to replace the NaN
's.
#data has random -999 and 985 values, replace with NaN
df_2018['PM']=df_2018['PM'].replace(-999, np.NaN)
df_2018['PM']=df_2018['PM'].replace(985, np.NaN)
df_2018['PM'] = df_2018['PM'].interpolate()
Then, in order to get the daily average (my data is given in hourly intervals), I run the following code, which does exactly what it is supposed to, groups the hourly value by day and gives the average.
df_2018['Date'] = pd.to_datetime(df_2018['Date'])
df_2018 = df_2018.groupby(pd.Grouper(freq='D', key='Date')).mean()
However, there are entirely missing days worth of data, for when i look at df_2018
now, the days that were completely missing look like current dataframe after groupby
I cannot figure out how to go back into the dataframe, and replace the empty cell under the PM
column with an np.NaN
in order to do the interpolation again.
Should I be 'going back', is there a way for me to scope out the missing days first before running the interpolation and groupby function?