4

I am trying to do forecasting based on ARIMA. Currently I am choosing the best ARIMA model and predicting for a certain period based on the best chosen ARIMA model. I am doing that by getting the AIC value and keeping the fact in mind that: The lesser the AIC the better. However, I need to be able to implement a way to verify for my function so that I do not solely have to rely on the least AIC value. So, there should be another way to detect that the model I chose is giving me the best results.

To give a clear view, let's say my ARIMA model is supposed to give me values between 5 to 10 based on the historical data input but for some reason after finding the best model it is giving me values which lies somewhere around 1000. It is definitely unusual.

What could be an alternative way to verify that ARIMA model is giving me the correct values apart from the given (least AIC) approach?

Following is my code:

import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set()
import statsmodels.tsa.api as smt
import statsmodels.api as sm

def arima_ci(df_train):

  df_s = df_train

  param_range = 3
  ps = range(0, param_range)
  d = 1
  qs = range(0, param_range)

  # Create a list with all possible combinations of parameters
  parameters = product(ps, qs)
  parameters_list = list(parameters)

  # Train many ARIMA models to find the best set of parameters
  def optimize_ARIMA(parameters_list, d):
      """
          parameters_list - list with (p, q) tuples
          d - integration order
      """

      results = []
      best_aic = float('inf')

      for param in parameters_list:
          try:
              
              model = sm.tsa.SARIMAX(df_s, order=(param[0], d, param[1])).fit(disp=-1)
          except:
              continue

          aic = model.aic

          # Save best model, AIC and parameters
          if aic < best_aic:
              best_model = model
              best_aic = aic
              best_param = param
          results.append([param, model.aic])

      result_table = pd.DataFrame(results)
      result_table.columns = ['parameters', 'aic']
      # Sort by AIC in ascending order (lower AIC is better)
      result_table = result_table.sort_values(by='aic', ascending=True).reset_index(drop=True)

      return result_table
  with warnings.catch_warnings():
    warnings.filterwarnings("ignore")  # Ignore all warnings within this block
    result_table = optimize_ARIMA(parameters_list, d)
  # result_table = optimize_ARIMA(parameters_list, d)

  p, q = result_table.parameters[0]

  best_model = sm.tsa.SARIMAX(df_s, order=(p, d, q)).fit(disp=-1)
  # print(best_model.summary())

  #
  # do forecast for period?
  n_steps = fcast_period = 1

  forecast_values = best_model.forecast(steps=n_steps)
  # print(forecast_values)

  # #
  forecast = best_model.get_forecast(steps=n_steps)

  forecast_values = forecast.predicted_mean
  forecast_ci = forecast.conf_int(alpha=0.05)
  lower_ci = forecast_ci.iloc[:, 0]
  upper_ci = forecast_ci.iloc[:, 1]

  last_date = df_train.index[-1]  # Last date in original data
  next_month = last_date + pd.DateOffset(months=1)
  # Get the next month after last_date

  # c i
  arima_forecast_df = pd.DataFrame({
      # 'Invoice Date': pd.date_range(start=next_month, periods=n_steps, freq='MS'),
      'arima': forecast_values.astype(int),
      'arima_l': lower_ci.astype(int).apply(lambda x: max(0, x)),  # Adding lower confidence interval
      'arima_u': upper_ci.astype(int)   # Adding upper confidence interval
  })

  return arima_forecast_df

result_arima_ci = arima_ci(dfn_resampled) 
print(type(result_arima_ci))
result_arima_ci

In this regard, dfn_resampled is a Pandas series. In simple words, it is my training data

# code
dfn_resampled.info() 
# output
<class 'pandas.core.series.Series'>
DatetimeIndex: 73 entries, 2017-07-01 to 2023-07-01
Freq: MS
Series name: Quantity
Non-Null Count  Dtype
--------------  -----
73 non-null     int64
dtypes: int64(1)

I am avoiding the auto-arima library as that gave me poor results. Please help me with this.

raiyan22
  • 1,043
  • 10
  • 20

2 Answers2

0

You can benchmark every possible model using the MSE over more recent data. Depending on the granularity of your data and your window forecasting needs, this could be the last day, the last week, the last month, etc. Then just choose the model that returns the least mean square error (MSE). I've got good results in the past using this technique.

Eduardo
  • 2,405
  • 3
  • 14
  • 19
  • Thanks, my training data has one datapoint every months from 2017-07-01 to 2023-07-01 after resampling. And I am doing forecast for (2023-08-01) which is the next month just after the last date I have in my training data. Do you have an example code in this regard of what you are suggesting please? That would be really helpful for me to decide and work on further. – raiyan22 Aug 23 '23 at 14:01
0

You have several options to evaluate the model, if you want to use sklear, you can pick this ones:

from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score

You can find the docs here.

The approach would always be to use the test data (which is not part of your train data) with the corresponding forecast values.

Which metric you want to use, depends on your desired output

PV8
  • 5,799
  • 7
  • 43
  • 87