1

I have a time series dataset that contains 29,184 hours of data. There are around 1k+ rows of missing values.

Here is the link to the dataset: dropbox

Here is a small preview of the dataset:

                     NUMBER_OF_VEHICLES
DATE_TIME                              
2020-01-01 00:00:00                39.0
2020-01-01 01:00:00                 3.0
2020-01-01 02:00:00                 1.0
2020-01-01 03:00:00                 1.0
2020-01-01 04:00:00                 2.0

Here is a section of a plot that shows missing values: enter image description here

The column NUMBER_OF_VEHICLES contains both hourly and weekly seasonality. I am trying to meaningfully fill in the missing values in a way that captures the hourly and weekly seasonality too. Here are 2 approaches I did:

Approach 1

# Saving index of missing values
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index

# Create HOUR and DAYOFWEEK features from datetime index
df= (
    df.assign(HOUR=lambda x: x.index.hour,
              DAYOFWEEK=lambda x: x.index.dayofweek)
)

# Linear interpolation
df['lin_impute'] = df['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")

Here is the plot of the plot after linear interpolation. The indexes with missing values are shown in red color. enter image description here

The above approach does not capture the seasonality patterns. The missing data is filled in linearly.

Approach 2

# Saving index of missing values
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index

# Create HOUR and DAYOFWEEK features from datetime index
df= (
    df.assign(HOUR=lambda x: x.index.hour,
              DAYOFWEEK=lambda x: x.index.dayofweek)
)

# Create a separate data frame with the mean number of vehicles per hour in a day.
hr_pattern = (
    df
    .groupby('HOUR', as_index=False)['NUMBER_OF_VEHICLES']
    .mean()
    .rename(columns={"NUMBER_OF_VEHICLES" : "hr_pattern"})
)

# Create a separate data frame with the mean number of vehicles per day of week.
week_pattern = (
    df
    .groupby('DAYOFWEEK', as_index=False)['NUMBER_OF_VEHICLES']
    .mean()
    .rename(columns={"NUMBER_OF_VEHICLES" : "week_pattern"})
)

# Merge with the main dataset
df_merged = (
    df.reset_index()
    .merge(hr_pattern, on='HOUR', how='inner')
    .merge(week_pattern, on='DAYOFWEEK', how='inner')
    .set_index('DATE_TIME')
)

# Linear interpolation
df_merged['lin_impute'] = df_merged['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")

Now, here is the plot after approach 2: enter image description here

The above plot shows that the missing values have been filled in a way that shows the hourly and weekly patterns in the data.

My question is how did the 2nd approach lead to this result where the linearly interpolated time series captures both the hourly and weekly patterns in the data instead of filling in the missing values linearly?

Please help me to understand. Thank you so much for your help.

Joe
  • 91
  • 6
  • 1
    Linear interpolation creates a line, thus the name. This should not be surprising. – Mark Ransom Jun 12 '23 at 12:02
  • hi @MarkRansom, I am trying to understand what caused the difference in results between the 1st approach and 2nd approach. – Joe Jun 12 '23 at 12:15

1 Answers1

1

The short answer is no.

What happens in the second case is that it creates two extra datasets with the mean values per hour and day of the week and joining them with the main dataset, so it fills some of the missing values with the mean of the whole dataset and then interpolating the rest of missing values.

In other words, it seems it captures the seasonality because it populates some of the missing values by the mean per hour and day of the week, and then the missing values left are interpolated better with the linear interpolation because it works well when the adjacent data values to the missing value are not missing. That's why in the first case it's a straight line from the left adjacent value to the right one.

Jose
  • 632
  • 1
  • 13
  • 2
    thank you for the insight. In your opinion, is this approach "correct" in filling in missing values where the data contains seasonal patterns? – Joe Jun 12 '23 at 15:03
  • 3
    No problem. It all depends on the application and complexity of the solution. The second approach is very straightforward and fast to implement, but maybe not so accurate in terms of capturing the seasonality. Another more complex and accurate solution could be to use a Seasonal AutoRegressive model to capture better the seasonality, but it would involve more steps. To give you an idea you could check a recent answer I gave in stack [here](https://stackoverflow.com/a/76333710/16984181) (also read the comments, since there a few things explained there) – Jose Jun 12 '23 at 15:37