0

I am trying to fit a regression model to a time series data in Python (basically to predict the trend). I have applied seasonal decomposition using statsmodels earlier which extracts data to its three components including the data trend. However, I would like to know how I can come up with the best fit to my data using statistical-based regressions (by defining any functions) and check the sum of squares to compare various models and select the best one which fits my data. I should mention that I am not looking for learning-based regressions which rely on training/testing data. I would appreciate if anyone can help me with this or even introduces a tutorial for this issue.

NJE
  • 37
  • 7
  • What's wrong with scikit learn? https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html – K. Shores Jul 31 '20 at 00:15
  • @K. Shores Thanks for your comment. I checked the link...but it seems this is only for linear regression? I would like to examine various regression models...like polynomial, exponential, sinusoidal and even their combinations. – NJE Jul 31 '20 at 01:16
  • I'm pretty sure scikit learn has options for all of those. Search their documentation – K. Shores Jul 31 '20 at 15:55
  • Thank you so much! – NJE Jul 31 '20 at 22:45

1 Answers1

0

Since you mentioned:

I would like to know how I can come up with the best fit to my data using statistical-based regressions (by defining any functions) and check the sum of squares to compare various models and select the best one which fits my data. I should mention that I am not looking for learning-based regressions which rely on training/testing data.

Maybe ARIMA (Auto Regressive Integrated Moving Average) model with given setup (P,D,Q), which can learn on history and predict()/forecast(). Please notice that split data into train and test are for sake of evaluation with approach of walk-forward validation:

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
# load dataset
def parser(x):
    return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('/content/shampoo.txt', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)
series.index = series.index.to_period('M')
# split into train and test sets
X = series.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
# walk-forward validation
for t in range(len(test)):
    model = ARIMA(history, order=(5,1,0))
    model_fit = model.fit()
    output = model_fit.forecast()
    yhat = output[0]
    predictions.append(yhat)
    obs = test[t]
    history.append(obs)
    print('predicted=%f, expected=%f' % (yhat, obs))
# evaluate forecasts
rmse = sqrt(mean_squared_error(test, predictions))
rmse_ = 'Test RMSE: %.3f' % rmse

# plot forecasts against actual outcomes
pyplot.plot(test, label='test')
pyplot.plot(predictions, color='red', label='predict')
pyplot.xlabel('Months')
pyplot.ylabel('Sale')
pyplot.title(f'ARIMA model performance with {rmse_}')
pyplot.legend()
pyplot.show()

I used the same library package you mentioned with following outputs including Root Mean Square Error (RMSE) evaluation:

import statsmodels as sm
sm.__version__ # '0.10.2'

img

Please see other post1 & post2 for further info. Maybe you can add trend line too

Mario
  • 1,631
  • 2
  • 21
  • 51