my dataframe looks like this:
Date Name Speed DOW Year m d hour
64603 06/02/18 NB_MID NaN Tuesday 2018 2 6 0
64604 06/02/18 NB_MID NaN Tuesday 2018 2 6 0
64608 06/02/18 NB_MID 43.600 Tuesday 2018 2 6 0
64610 06/02/18 NB_MID 4.630 Tuesday 2018 2 6 0
64616 06/02/18 NB_MID 24.633 Tuesday 2018 2 6 0
... ... ... ... ... ... ... ... ...
503754 27/02/18 NB_MID 85.088 Tuesday 2018 2 27 23
503756 27/02/18 NB_MID 15.982 Tuesday 2018 2 27 23
503761 27/02/18 NB_MID 51.725 Tuesday 2018 2 27 23
503762 27/02/18 NB_MID 3.016 Tuesday 2018 2 27 23
503763 27/02/18 NB_MID 3.833 Tuesday 2018 2 27 23
The 'Speed' column has some missing values. To replace the missing SPEED values, I want to use a median calculated for the particular hour of the day when the missing value occurs. This median value would then be used to replace every NaN for that particular hour of the day.
There is 24 hours in the day (0-24) and therefore it makes sense to calculate 24 different median values.
I can use:
df_median = df["Speed"].median()
to find a median for the whole dataframe.
And therefore I could use something like this:
df1 = df.loc[df['hour'] == 1]
df_median = df1["Speed"].median()
df_median
OUTPUT: 40.56
I could keep repeating this code for every hour of the day, but the way I would do it would be to create a new dataframe each time and change the hour each time... like this:
df2 = df.loc[df['hour'] == 2]
df_median2 = df2["Gap"].median()
df_median2
Surely there is a better way to do it? Also, what would then be the best way to replace each NaN with the correct median? For example, when 'hour' column = 1 replace all NaN 'Speed' values with 40.56
I hope I have explained this well enough. I think the way I am doing this could be improved. Thanks for any help.