2

I am currently building an ARIMAX model with the library pmdarima by using:

pmdarima.pipeline.Pipeline.fit(y, exogenous=None, **fit_kwargs)

The parameter is described:

exogenous : array-like, shape=[n_obs, n_vars], optional (default=None)

An optional 2-d array of exogenous variables. If provided, these variables are used as additional features in the regression operation. This should not include a constant or trend. Note that if an ARIMA is fit on exogenous features, it must be provided exogenous features for making predictions.

But I do not understand what this format means: shape=[n_obs, n_vars]?

What is the meaning of n_obs and n_vars?

And why we need this format and not an exogenous variable in a time series format?

tuomastik
  • 4,559
  • 5
  • 36
  • 48

2 Answers2

2

Mister Taylor Smith sent me an email:

Exogenous variables, or covariates, are presented as 2-dimensional matrices to most ML algorithms, as I'm sure you're aware. Along the row axis are observations, and along the column axis are variables or feature vectors (hence n_samples x n_features). The convention you are asking about is one that Numpy and scikit-learn use in denoting the shape of an array-like object (see for instance the documentation on scikit-learn's Lasso). shape=[n_obs, n_vars] simply means a 2-d matrix with samples along the rows and variables along the columns.

As to your question about why you cannot use a time series... your y variable should be a time series (just a vector, or 1-d array, really), as that's what you're going to forecast from. That is the only required piece of data. The exogenous variables are purely optional pieces of supplementary data.

tuomastik
  • 4,559
  • 5
  • 36
  • 48
0

I was looking for this question - here is how I got it to work with exogenous variables. Use model.summary() to verify the exogenous variable in the model.

# result_df is the main df
# predicts next 12 months after 24 months of training

i_split = 24
model_input = 'arima'

target = 'target_var'
exogenous = 'exogenous_var'

y_train = result_df.loc[0:i_split, [target]+[exogenous]]
y_test = result_df.loc[i_split+1:36, [target]+[exogenous]]

if model_input == 'arima_auto':
    model = pm.auto_arima(y_train,seasonal=False,m=12,stepwise=True,trace=True,start_p=0,start_q=0,start_P=0,start_Q=0,max_p=2,max_q=2,maxiter=50000,with_intercept=True,trend='ct')
elif model_input == 'arima':
    model = pm.arima.ARIMA(order=(1,0,1),seasonal=False,m=12,stepwise=True,trace=True,maxiter=6000,with_intercept=True,trend='ct')

# Train on x_train, y_train
model.fit(y_train['target'],X=pd.DataFrame(y_train['exogenous']))

# Predict on x_test
preds = model.predict(n_periods=12,X=pd.DataFrame(y_test['exogenous']))