Auto ARIMA in Python results in poor fitting prediction of trend

Question

New to ARIMA and attempting to model a dataset in Python using auto ARIMA. I'm using auto-ARIMA as I believe it will be better at defining the values of p, d and q however the results are poor and I need some guidance. Please see my reproducible attempts below

Attempt as follows:

    # DEPENDENCIES
    import pandas as pd 
    import numpy as np 
    import matplotlib.pyplot as plt
    import pmdarima as pm 
    from pmdarima.model_selection import train_test_split 
    from statsmodels.tsa.stattools import adfuller
    from pmdarima.arima import ADFTest
    from pmdarima import auto_arima
    from sklearn.metrics import r2_score 

# CREATE DATA
data_plot = pd.DataFrame(data removed)

# SET INDEX
data_plot['date_index'] = pd.to_datetime(data_plot['date']
data_plot.set_index('date_index', inplace=True)

# CREATE ARIMA DATASET
arima_data = data_plot[['value']]
arima_data

# PLOT DATA
arima_data['value'].plot(figsize=(7,4))

The above steps result in a dataset that should look like this.

# Dicky Fuller test for stationarity 
adf_test = ADFTest(alpha = 0.05)
adf_test.should_diff(arima_data)

Result = 0.9867 indicating non-stationary data which should be handled by appropriate over of differencing later in auto arima process.

# Assign training and test subsets - 80:20 split 

print('Dataset dimensions;', arima_data.shape)
train_data = arima_data[:-24]
test_data = arima_data[-24:]
print('Training data dimension:', train_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')
print('Test data dimension:', test_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')

# Plot training & test data
plt.plot(train_data)
plt.plot(test_data)

 # Run auto arima
    arima_model = auto_arima(train_data, start_p=0, d=1, start_q=0,
    max_p=5, max_d=5, max_q=5,
    start_P=0, D=1, start_Q=0, max_P=5, max_D=5,
    max_Q=5, m=12, seasonal=True,
    stationary=False,
    error_action='warn', trace=True,
    suppress_warnings=True, stepwise=True,
    random_state=20, n_fits=50)
        
    print(arima_model.aic())

Output suggests best model is 'ARIMA(1,1,1)(0,1,0)[12]' with AIC 1725.35484

#Store predicted values and view resultant df

prediction = pd.DataFrame(arima_model.predict(n_periods=25), index=test_data.index)
prediction.columns = ['predicted_value']
prediction

# Plot prediction against test and training trends 

plt.figure(figsize=(7,4))
plt.plot(train_data, label="Training")
plt.plot(test_data, label="Test")
plt.plot(prediction, label="Predicted")
plt.legend(loc='upper right')
plt.show()

# Finding r2 model score
    test_data['predicted_value'] = prediction 
    r2_score(test_data['value'], test_data['predicted_value'])

Result: -6.985

score 2 · Answer 1 · answered Jun 01 '23 at 12:59

ARIMA has assumptions which need to be checked before applying it to the data . One of them is data Need to be stationary i.e it should not have trend or seasonality . You can check trend through plotting , which is visible in your graph thent it has upwards trend .

1.You can seasonality also from graph or use Dicker fuller test to check hypothesis.

import statsmodels.tsa.stattools as ts
ts.adfuller(data.col)

Check this answer , how to perform and read ad fuller test has been well explained . How to interpret adfuller test results?

Always check the ACF and PACF plots and at which lags are lying beyound the limits , shows autocorrelation. Check the whether the Stationarity exits

As explained by Jose , differencing can be done to Stationarize the data.

SARIMA Algorithms considers the Seasonal components (p,d,q) and (S,P',D',Q') and also the exogenous varaibles .

Jose · Answer 2 · 2023-06-01T14:30:03.273

Is auto_arima a method done by you? It depends how you differentiate and what you do there. Did you check the autocorrelation and partial autocorrelation to know which repeating time lags you have there?

Also, it seems you have some seasonality patterns every year, you could try a SARIMA model if you are not doing it already.

To try a SARIMA model you have to:

Stationarized the data, in this case by differentiation you can convert the moving mean a stationary one.

data_stationarized = train_data.diff()[1:]

Check the autocorrelation and partial autocorrelation to check the seasonality. You can use the library statsmodels for this.

import statsmodels.api as sm
sm.graphics.tsa.plot_acf(data_stationarized);

You can see that the most prominent flag is the twelfth flag, so as the granularity of the data is by month, that means there is prominent seasonality pattern every 12 months.

We can check the partial autocorrelation to confirm it too:

sm.graphics.tsa.plot_pacf(data_stationarized);

Again the most prominent flag is the twelfth one.

Fit the model with a seasonality order of 12. There are more parameters to explain which can be adjusted to have better results, but then this post will be very long.

model = sm.tsa.SARIMAX(endog=train_data, order=(2,0,0), seasonal_order=(2,0,0,12))
model_fit = model.fit()

Evaluate the results

from sklearn.metrics import mean_squared_error

y_pred = model_fit.forecast(steps=24)

# when squared=False then is equals to RMSE
mean_squared_error(y_true=test_data.values, y_pred=y_pred, squared=False)

This outputs 12063.88, which you can use to compare different results more rigorously.

For a graphical check:

prediction = pd.DataFrame(model_fit.forecast(steps=25), index=test_data.index)
prediction.columns = ['predicted_value']
prediction

# Plot prediction against test and training trends

plt.figure(figsize=(7,4))
plt.plot(train_data, label="Training")
plt.plot(test_data, label="Test")
plt.plot(prediction, label="Predicted")
plt.legend(loc='upper right')
plt.xticks([])
plt.yticks([])
plt.show();

Now you can see that the predictions get closer to the expected values.

You could continue fine tuning the order and seasonal order to get even better results, I will advice to check the docs of statsmodel.

Another advice it's to analyze the autocorrelation and partial autocorrelation of the residuals to check if your model is capturing all of the patterns. You have them in the model_fit object.

No, auto_arima is a Python library but for some reason I left it out of the import/dependencies list. I will try SARIMA, thank you. — jimiclapton, May 25 '23 at 19:35
@jimiclapton I have edited the answer to help you better, if you think I help you, you could give me a vote up :) — Jose, May 29 '23 at 16:50
Thanks for your help, Jose. Happy to accept the answer. Would appreciate if you could also remove the values from the y-axis on the chart in your asnwer. — jimiclapton, Jun 01 '23 at 13:57
Jose, you mentioned the seasonal order of 12 but you didn't mention how you arrived at with the values '2,0,0' for p,d & q. Can you elaborate, please? — jimiclapton, Jun 05 '23 at 14:43
@jimiclapton check this https://analyticsindiamag.com/quick-way-to-find-p-d-and-q-values-for-arima/ , here it's explained and much more detail. — Jose, Jun 06 '23 at 09:40

Auto ARIMA in Python results in poor fitting prediction of trend

2 Answers2

Linked