any way to predict monthly time series with scikit-learn in python?

Question

I want to forecast product' sales_index by using multiple features in the monthly time series. in the beginning, I started to use ARMA, ARIMA to do this but the output is not very satisfying to me. In my attempt, I just used dates and sales column to do forecasting, and output is not realistic to me. I think I should include more features column to predict sales_index column. However, I was wondering is there any way to do this prediction by using multiple features from the monthly time series. I haven't done much of time series using scikit-learn. Can anyone point me out any possible way of doing this? Any possible thoughts?

my attempt using ARMA/ARIMA:

Here is reproducible monthly time series data on this gist and here is my current attempt:

import pandas as pd
from statsmodels.tsa.arima_model import ARMA
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt

df = pd.read_csv("tsdf.csv", sep=",")
dates = pd.date_range(start='2015-01', freq='MS', periods=len(df))
df.set_index(dates,inplace=True)
train = df[df.index < '2019-01']
test = df[df.index >= '2019-01']

model = ARMA(train['sales_index'],order=(2,0))
model_fit = model.fit()
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)
# plot results
plt.figure(figsize=(12,6))
plt.plot(test['sales_index'])
plt.plot(predictions, color='red')
plt.show()

and here is the output of my current attempt:

in my attempt, I just simply used df['sales_index] and df['dates'] for ARMA model. Clearly doing this way, the prediction output is not very realistic and informative. I am thinking if there is any way I can feed all features columns except df['sales_index'] to the model to predict df['sales_index']. I couldn't figure out better way of doing this with ARMA model.

Perhaps scikit-learn might serve better roles for this prediction. I am not sure how to achieve this using sklearn to do this time series analysis. Can anyone point me out possible sklearn solution for this time series? Is there any possible of doing this in sklearn? Any possible thoughts? Thanks

This looks interesting. Any chance you could say what the column names stand for? Most of them are pretty opaque. Also, if you're importing from your gist, its tab seperated, not comma seperated. — hrokr, Aug 25 '20 at 17:11
I'm playing with it now. What I've seen so far is there is some seasonality (from a plotly express < fig = px.line(df, x="dates", y="sales_index"> but your timeframe is pretty short relative to the variation. Hence, asking about the other columns. It's like if oil goes up, jet fuel goes up, airline profits go down. — hrokr, Aug 25 '20 at 17:43
Well, you develop a model to make predictions. Being ok with bad now as you iterate through it is fine but you're going somewhere with it. So you want the seasonality and to be able to show its effects. Knowing *how* the data is connected is important. I can show a great predictive model of the number of Irish and Italian immigrants from '42-43 to the number of meteors for 1950. But it's just complete BS. That's why I was asking about what the other columns are. Some may be colinear. — hrokr, Aug 25 '20 at 18:00
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/220449/discussion-between-hamilton-and-hrokr). — Hamilton, Aug 25 '20 at 18:09
@Hamilton does the answer would more likely consider univariate approach for forecasting or multivariate approach? — Miguel Trejo, Aug 28 '20 at 19:27
@MiguelTrejo I could say consider it as multivariate approach forecasting because I want to see additive power of different features on the model to forecast `sales_index`. I would like to see both univariate and multivariate approach of forecasting. Any possible thoughts? — Hamilton, Aug 28 '20 at 20:14
@Hamilton, yes indeed one can use trees for this. But one should first remove the trend from the data. One thing that gets my attention on your graph is that there's appear to be a structural change between 2020-01 and 2020-03. For this analysis, using some R libraries within Python would be suitable. Would you consider an approach like this or you want to stick to Python libraries? — Miguel Trejo, Aug 28 '20 at 21:11
@MiguelTrejo I think it would be interesting if we can use R libraries within Python, I know how to call R function in python. I'd like to hear your elaboration on answer thread if it is possible. Thank you! — Hamilton, Aug 29 '20 at 02:16
@MiguelTrejo could you show the approach by using [reproducible data on this gist](https://gist.github.com/jerry-shad/425723d8ea0dbdb6fa7a72f999365996) ? Do we need to remove any seasonality from the data? Do we need to validate time series is stationary or not? Any possible updates? Thanks a lot:) — Hamilton, Aug 30 '20 at 15:30
@Hamilton, yes the stationary part will tell you how many times you'll need to differentiate your time series, in the same way for the seasonal part, how many seasonal difference should we apply to the data. Let me provide you a quick update on that. — Miguel Trejo, Aug 30 '20 at 18:50
@MiguelTrejo thanks for your detailed answer. In your approach, you only used `dates` and `sales_index` to make prediction. Can we add other columns as potential features that going to reinforce forecasting `sales_index`? How do make this happen? Plus, why we only include log_difference values for forecasting `sales_index`? can we include other columns as features and do the forecasting of `sales_index`? Could you share your possible extended thoughts with time series plot of forecasting in gist or colab? I going to make accept your canonical answer shortly. Thanks a lot:) — Hamilton, Aug 31 '20 at 01:37
@MiguelTrejo why we use ` np.exp(tsdf.tail(test_size).log_sales_index_lag_1 + pred)` when we make plot? `x-axis` supposed to be date, how do we do that? any possible idea you can point me out? Thanks! — Hamilton, Oct 08 '20 at 17:30

Miguel Trejo · Accepted Answer · 2020-08-30T20:59:54.750

Overview

By using Scikit-Learn library, one can consider different Decision Trees to forecast data. In this example, we'll be using an AdaBoostRegressor, but alternatively, one can switch to RandomForestRegressor or any other tree available. Thus, by choosing trees one should we aware of removing the trend to the data, in this way, we illustrate the example of controlling the mean and variance of the time series by differencing and applying logarithm transform respectively to the data.

Preparing the data

A time series has two basic components, it's mean and it's variance. Ideally, we would like to control this components, for the variability, we can simply apply a logarithm transformation on the data, and for the trend we can differentiate it, we will see this later.

Additionally, for this approach, we're considering that the actual value y_t can be explained by the two prior value y_t-1 and y_t-2. You can play around with these lags values by modifying the input to the range function.

# Load data
tsdf = pd.read_csv('tsdf.csv', sep="\t")

# For simplicity I just take the target variable and the date
tsdf = tsdf[['sales_index', 'dates']]

# Log value
tsdf['log_sales_index'] = np.log(tsdf.sales_index)

# Add previous n-values
for i in range(3):
    
    tsdf[f'sales_index_lag_{i+1}'] = tsdf.sales_index.shift(i+1)
    
    # For simplicity we drop the null values 
    tsdf.dropna(inplace=True)
    
    tsdf[f'log_sales_index_lag_{i+1}'] = np.log(tsdf[f'sales_index_lag_{i+1}'])
    
    tsdf[f'log_difference_{i+1}'] = tsdf.log_sales_index - tsdf[f'log_sales_index_lag_{i+1}']

Once our data is ready, we get to a result similar to the one on the image below.

Is data stationary?

To control the mean component of the time series, we should perform some differencing on the data. In order to determine whether this step is required we can perform a unit root test. There are several test for this that make different assumptions, a list of some unit root tests can be found here. For simplicity, we're going to consider the KPSS test, by this we assumme as null hypothesis that the data is stationary, particularly, it assumess stationarity around a mean or a linear trend.

from arch.unitroot import KPSS

# Test for stationary
kpss_test = KPSS(tsdf.sales_index)

# Test summary 
print(kpss_test.summary().as_text())

We see that the P-value = .280 is greater than the usual convention of 0.05. Therefore, we would need to apply a first difference to the data. As a side note, one can iteratively perform this test to know how many differences should be applied to the data.

In the graph below, we see a comparison of the original data vs the log first difference of it, notice there is a sudden change in these last values of the time series, this appears to be a structural change but we are not going to deep dive on it. If you want to dig more on this topic these slides from Bruce Hansen are useful.

plt.figure(figsize=(12, 6))
plt.subplot(1,2,1)
plt.plot(tsdf.sales_index)
plt.title('Original Time Series')
plt.subplot(1,2,2)
plt.plot(tsdf.log_difference_1)
plt.title('Log first difference Time Series')

Decision Tree Model

As we've previously said, we're considering Decision Tree Models, when using them one should be aware of removing the trend from the time series. For example, if you have an upward trend, tress are not good at predicting a downward trend. In the code example below, I've choose the AdaBoostRegressor, but you're free to choose other tree model. Additionnaly, notice that log_difference_1 is consider to be explained by log_difference_2 and log_difference_3.

Note. Your dataset has other covariates as aus_avg_rain or slg_adt_ctl, so to consider them for predicting you could apply as well lag values on them.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor

# Forecast difference of log values
X, Y = tsdf[['log_difference_2', 'log_difference_3']], tsdf['log_difference_1']

# Split in train-test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, shuffle=False, random_state=0)

# Initialize the estimator
mdl_adaboost = AdaBoostRegressor(n_estimators=500, learning_rate=0.05)

# Fit the data
mdl_adaboost.fit(X_train, Y_train)

# Make predictions
pred = mdl_adaboost.predict(X_test)

test_size = X_test.shape[0]

Evaluating the prediction

test_size = X_test.shape[0]
plt.plot(list(range(test_size)), np.exp(tsdf.tail(test_size).log_sales_index_lag_1  + pred), label='predicted', color='red')
plt.plot(list(range(test_size)), tsdf.tail(test_size).sales_index, label='real', color='blue')
plt.legend(loc='best')
plt.title('Predicted vs Real with log difference values')

It seems that Decision Tree model is accurately predicting the real values. However, to evaluate the model performance we should consider an evaluation metric, a good intro to this topic can be found on this article, feel free to pick the one that is more convenient to your approach. I'm just going to go with the TimeSeriesSplit function from scikit-learn to assess model's error through the mean absolute error.

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error

X, Y = np.array(tsdf[['log_difference_2', 'log_difference_3']]), np.array(tsdf['log_difference_1'])

# Initialize a TimeSeriesSplitter
tscv = TimeSeriesSplit(n_splits=5)

# Retrieve log_sales_index and sales_index to unstransform data
tsdf_log_sales_index = np.array(tsdf.copy().reset_index().log_sales_index_lag_1)
tsdf_sales_index = np.array(tsdf.copy().reset_index().sales_index_lag_1)

# Dict to store metric value at every iteration
metric_iter = {}


for idx, val in enumerate(tscv.split(X)):
    
        train_i, test_i = val
    
        X_train, X_test = X[train_i], X[test_i]
        Y_train, Y_test = Y[train_i], Y[test_i]

        # Initialize the estimator
        mdl_adaboost = AdaBoostRegressor(n_estimators=500, learning_rate=0.05)

        # Fit the data
        mdl_adaboost.fit(X_train, Y_train)

        # Make predictions
        pred = mdl_adaboost.predict(X_test)
        
        # Unstransform predictions
        pred_untransform = [np.exp(val_test + val_pred) for val_test, val_pred in zip(tsdf_log_sales_index[test_i], pred)]
        
        # Real value
        real = tsdf_sales_index[test_i]
        
        # Store metric
        metric_iter[f'iter_{idx + 1}'] = mean_absolute_error(real, pred_untransform)

Now we see that the average MAE error is quite low.

print(f'Average MAE error: {np.mean(list(metric_iter.values()))}')
>>> Average MAE error: 17.631090959806535

instead of using `dates` and `sales_index` for forecasting, how do we include other columns as features to make it use in the model? when we used `AdaBoostRegressor` in the part of model performance evaluation, how do we make nice time series forecasting plot where x-axis shows dates, y-axis shows real vs predicted sales_index value? Could you share your possible extended thoughts on gist or colab to make your answer thread tidy? Thank you very much! — Hamilton, Aug 31 '20 at 01:49
@Hamilton,yes you include the other variables as well.For example,here we consider X = tsdf[['log_difference_2', 'log_difference_3']],because we're initially assuming that these variables could explain our target 'log_difference_1'. However, you could also directly incorporate the other ones X = tsdf[['log_difference_2', 'log_difference_3', 'aus_avg_rain', 'slg_adt_ctl', ..., etc]], that is you're assuming only the current values affect the target variable. Notice,that it will be interesting to see what is the relationship with these lag values, for example, aus_avg_rain_lag1, slg_adt_ctl_lag1 — Miguel Trejo, Aug 31 '20 at 16:03
why `x-axis` is not showing date? why we do showing `np.exp(tsdf.tail(test_size).log_sales_index_lag_1 + pred)` when we make instead of date? any idea? — Hamilton, Oct 08 '20 at 17:32
@Hamilton, for the x-axis to show the date you could retrieve the index of test data to plot dates. Regarding the transformation `exp(t_lag + t)` is to return data to original scale, remember that we apply a log of differences to stabilize the mean and variant component of the time series. — Miguel Trejo, Oct 11 '20 at 01:04
Thank you for the excellent details! However, I am confused with your original premise: "we assume as null hypothesis that the data is stationary", then you show `p=0.28>0.05`, which would mean that _you have not disproven the null hypothsis_. This says that the data is indeed stationary... or at least "stationary enough". I agree that it is probably best to take the 1st log difference, but only because there is no need to arbitrarily need 95% confidence that it is non-stationary. Am I missing something? — Mike Williamson, Aug 19 '21 at 12:23
BTW, this is a very common trap... your model prediction seems as if it's predicting right but shifted ... no, it's in fact trying to predict the last value it saw ... not really a time-series forecasting... there is nothing to blame here as the vast majority of future forecasting is airy-fair of unreal — Yahya, Jan 22 '23 at 19:25

hrokr · Answer 2 · 2020-08-31T03:41:57.557

There are two parts to this:

A working predictive code solution.
Brief discussions of several topics

Solution

I'm going to suggest a slightly different and somewhat more abstracted approach: Use Darts which is built on top of scikit-learn. It includes 'A list' libraries (e.g., pandas and NumPy) you'd expect but also a few that you would need to have fairly deep knowledge of the field to think to include (e.g., Torch, Prophet, and Holidays). Furthermore, it has a number of models already included. You can see a bit more for both here. One important note: the particular library is u8darts, not darts. They're similar -- and have similar dependencies -- but they're not the same.

Using it, you can develop a predictive model with pretty good results easily. For example, the entire code, including incorporating the renamed column headings (as well as dropping a column that you didn't rename and a duplicate row) is just a few lines.

import pandas as pd
import matplotlib.pyplot as plt

from darts import TimeSeries
from darts.models import ExponentialSmoothing

df = pd.read_csv("../data/tsdf.csv", sep="\t")
df.drop(['Unnamed: 0', 'aus_slg_fmCatl'], axis=1, inplace = True)

df.columns = ['Date', 'adult_cattle(head)','Bulls Bullocks & Steers(head)', 'Cows&Heifers(head)',
              'Calves(head)', 'Beef(tonnes)', 'Veal(tonnes)','Adult cattle(kg/head)', 'Calves(kg/head)',
              'aust-avg_rain','US/AUS_ExchangeRate', 'sales_index', 'retail_sales_index']

df.drop_duplicates(subset=['Date'], inplace=True)

series = TimeSeries.from_dataframe(df, 'Date', 'Beef(tonnes)').values

train, val = series.split_before(pd.Timestamp('2019-01-01'))
model = ExponentialSmoothing()
model.fit(train)
prediction = model.predict(len(val))

series.plot(label='actual')
prediction.plot(label='forecast', lw=2)
plt.legend()

Additionally, their repo has a notebook with additional models and examples

Result

Discussion

Addressing the discussion in the comments we've had, I think a few points need to be made.

Removal of Seasonality

This is something you very rarely want to do if you are shooting for a predictive model. If you knew that every winter the price of fuel went up for home/office heating but decided to factor that out -- you'd get eaten alive by traders who would gladly take your money.

Dummy Data

The trick here is you don't know if your dummy data is going to look similar enough to the real thing. The solution to that, however, is a different question.

"use R libraries within Python..."

Take a look at r2py

ARMA and ARIMA

... are not likely to yield fruitful results in part because the error term assumes error to be IID but markets have a bias. Exponential smoothing works well because it substitutes

"But when I run it against sales_index..."

The sales_index is going to be comprised of several independent variables. Production of meat (oddly with an inverse correlation with rainfall) and exchange rate as you would expect. But, not included in your model is data for production in other countries (100% pure organic, grass feed, 'AAA' Alberta beef), taxes, domestic beef production, shipping, etc. And that is why you probably don't want to up against professional traders who have both quants and serious domain expertise. Finally, I'd note there is nothing timewise that would give one the indication this was normal or not.

There seems to be a 1.5-year upswing from 2015-2017, followed by a downswing until 2019. That appears to be repeating but the sudden change in 2020 leads one to believe this is an aberration. But the data is so small, it's hard to judge any validity. Trend lines or channels would work better in this case.

The OP asked if this could be done with `sklearn`, not `darts`. I, for instance, cannot use `darts` because I am stuck on a platform using Python 3.6. :( — Mike Williamson, Aug 19 '21 at 12:18

score 1 · Answer 3 · answered Aug 21 '20 at 07:23

1

You can add additional features to your ARMA model using the optional exog argument when you initialize the model and make predictions.

For example, to add a handful of your features:

# initialize the model
model = ARMA(train['sales_index'], 
             exog=train[['slg_adt_ctl', 'slg_bbs', 'retail_sales_index']],
             order=(2,0))

model_fit = model.fit() # fit the model

# make predictions
predictions = model_fit.predict(start=len(train), 
                                end=len(train)+len(test)-1, 
                                exog=test[['slg_adt_ctl', 'slg_bbs', 'retail_sales_index']], 
                                dynamic=False)

And when we produce the predictions plot, we now get some additional predictive power.

answered Aug 21 '20 at 07:23

user45392

610
7
16

2

how about scikit-learn approach? As you can see from above plot, the prediction output is not very good. Any possible optimal attempt to get better prediction on that? Any updated extended attempt? Thanks – Hamilton Aug 21 '20 at 14:06
Your question (before editing) asked how to make a "prediction by using multiple features from the monthly time series". How well the model predicts is out of scope for that question. – user45392 Aug 22 '20 at 03:29
right, but I explicitly seeking possible attempt using `scikit-learn`, I bet you might have possible approach with `scikit-learn`. Thanks for your heads up :) – Hamilton Aug 22 '20 at 05:26