I have a time series dataset that contains 29,184 hours of data. There are around 1k+ rows of missing values.
Here is the link to the dataset: dropbox
Here is a small preview of the dataset:
NUMBER_OF_VEHICLES
DATE_TIME
2020-01-01 00:00:00 39.0
2020-01-01 01:00:00 3.0
2020-01-01 02:00:00 1.0
2020-01-01 03:00:00 1.0
2020-01-01 04:00:00 2.0
Here is a section of a plot that shows missing values:
The column NUMBER_OF_VEHICLES
contains both hourly and weekly seasonality. I am trying to meaningfully fill in the missing values in a way that captures the hourly and weekly seasonality too. Here are 2 approaches I did:
Approach 1
# Saving index of missing values
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index
# Create HOUR and DAYOFWEEK features from datetime index
df= (
df.assign(HOUR=lambda x: x.index.hour,
DAYOFWEEK=lambda x: x.index.dayofweek)
)
# Linear interpolation
df['lin_impute'] = df['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")
Here is the plot of the plot after linear interpolation. The indexes with missing values are shown in red color.
The above approach does not capture the seasonality patterns. The missing data is filled in linearly.
Approach 2
# Saving index of missing values
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index
# Create HOUR and DAYOFWEEK features from datetime index
df= (
df.assign(HOUR=lambda x: x.index.hour,
DAYOFWEEK=lambda x: x.index.dayofweek)
)
# Create a separate data frame with the mean number of vehicles per hour in a day.
hr_pattern = (
df
.groupby('HOUR', as_index=False)['NUMBER_OF_VEHICLES']
.mean()
.rename(columns={"NUMBER_OF_VEHICLES" : "hr_pattern"})
)
# Create a separate data frame with the mean number of vehicles per day of week.
week_pattern = (
df
.groupby('DAYOFWEEK', as_index=False)['NUMBER_OF_VEHICLES']
.mean()
.rename(columns={"NUMBER_OF_VEHICLES" : "week_pattern"})
)
# Merge with the main dataset
df_merged = (
df.reset_index()
.merge(hr_pattern, on='HOUR', how='inner')
.merge(week_pattern, on='DAYOFWEEK', how='inner')
.set_index('DATE_TIME')
)
# Linear interpolation
df_merged['lin_impute'] = df_merged['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")
Now, here is the plot after approach 2:
The above plot shows that the missing values have been filled in a way that shows the hourly and weekly patterns in the data.
My question is how did the 2nd approach lead to this result where the linearly interpolated time series captures both the hourly and weekly patterns in the data instead of filling in the missing values linearly?
Please help me to understand. Thank you so much for your help.