2

I am trying to forecast a time series in Python by using auto_arima and adding Fourier terms as exogenous features. The data come from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. The specificity of this time series is that it has daily data with weekly and annual seasonalities.

In order to capture these two levels of seasonality I first used TBATS as recommended by Rob J Hyndman in Forecasting with daily data which worked pretty well actually.

I also followed this medium article posted by the creator of TBATS python library who compared it with SARIMAX + Fourier terms (also recommended by Hyndman).

But now, when I tried to use the second approach with pmdarima's auto_arima and Fourier terms as exogenous features, I get unexpected results.

In the following code, I only used the train.csv file that I split into train and test data (last year used for forecasting) and set the maximum order of Fourier terms K = 2.

My problem is that I obtain a smoothed forecast (see Image below) that do not seem to capture the weekly seasonality which is different from the result at the end of this article. Is there something wrong with my code ?

Complete code :

# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt

# Upload the data that consist in a long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) = 500 time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)

# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales

# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 2)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])


# Split the time series as well as exogenous features data into train and test splits 
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test =  y_prime.iloc[(len(y_prime)-365):] # last year for testing

exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]


# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, exogenous=exog_to_train, seasonal=True, m=7)

# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))


# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()

Actual data and forecasts over the last year of data

Thanks in advance for your answers !

Downforu
  • 317
  • 5
  • 13
  • I have not tried to run your code, but I saw from the [pmdarima docs](https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.AutoARIMA.html) that m should be the number of periods in each season. You have put this to be 7, which would be wrong since you have 1 year of data. I assume that this number should be 52. – Oddaspa Aug 25 '21 at 13:25
  • Why would it be 52 since we have daily records ? From the doc I understand that m is the number of records in my dataset inside each season. In the data we have a weekly seasonality and an annual seasonality. Therefore I have two possibilities : m = 7 or m = 365. But I do not see why it should be 52. Coul you please elaborate on that ? And because ARIMA does not deal with long seasonalities, I used it for the weekly seasonality. – Downforu Aug 25 '21 at 13:36
  • I assumed so because there are 52 weeks in a year. From what I understand 365 would mean daily seasonality, and 1 would mean annual seasonality. This was just a guess, I am running your code now to inspect. – Oddaspa Aug 25 '21 at 13:44
  • After running for 25 min, Colab ran out of RAM. Sorry that I could not be of any help. Good Luck! – Oddaspa Aug 25 '21 at 14:04
  • No problem, thank you for your help ! – Downforu Aug 25 '21 at 14:07
  • 1
    There is nothing wrong with your code, but for some reason `auto_arima` finds that weekly seasonal differencing is not optimal for your data (i.e. it returns `D=0` where `D` is the order of the seasonal differencing). You can set `D=1` in the `auto_arima` call directly, or otherwise leave `D=None` and change the other `auto_arima` optimization parameters (such as the information criterion, the number of iterations, etc.) to see if it eventually returns `D=1`, in which case your forecasts will look as expected. – Flavia Giammarino Aug 25 '21 at 14:24
  • Thank you very much ! it works now. I just need to dig deeper into these parameters to fully understand them. – Downforu Aug 25 '21 at 14:38

1 Answers1

3

Here's the answer in case someone's interested. Thanks again Flavia Giammarino.

# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt

# Upload the data that consists long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)

# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales

# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 1)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])


# Split the time series as well as exogenous features data into train and test splits 
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test =  y_prime.iloc[(len(y_prime)-365):] # last year for testing

exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]


# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, D=1, exogenous=exog_to_train, seasonal=True, m=7)

# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))


# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()

enter image description here

Downforu
  • 317
  • 5
  • 13
  • According to https://robjhyndman.com/hyndsight/longseasonality/, when using fourier terms as exogenous variables, the ARIMA model should not be seasonal. – bluesmonk Sep 14 '21 at 17:22
  • 1
    What Pr. Hyndman says is that long seasonalities are not meant to be captured by the ARIMA model. That's why the seaonal parameter is set to False and that he uses Fourier terms. But in my case, it concerns a two seasonal periods among which there is a smaller one (weekly, m=7). In the [link](https://robjhyndman.com/hyndsight/dailydata/) I posted, seasonal=FALSE but when you load more comments below and look for "Nassim" (the poster of a related question), Pr. Hyndman confirms that you can handle smaller seasonalities with seaonal=TRUE inside the auto.arima function. Hope this helps. – Downforu Sep 18 '21 at 10:02
  • I ran this code with the same data but the forecast doesn't follow the long trem seasonality – Hasan Zafari Jan 09 '23 at 02:27