I have monthly data about clicks on websites and want to build a SARIMA model to predict the next month's expected clicks. Because a SARIMA model needs to work with stationary data, I transformed the data and carried out the Augmented Dickey Fuller Test in Python in order to detect, when I can stop transforming it and start feeding it to the model (that would be the case when the p-value<0.05).
Since the data is seasonal, do I need to set the maxlag parameter in adfuller() to 12 and why / why not?
I carried out the adfuller-test in both versions:
- the default maxlag
- and maxlag=12
Of course I receive different results for the p-value:
myTimeSeries.plot()
adfuller(myTimeSeries) # p=0.113872
adfuller(myTimeSeries, maxlag=12) # p=0.996884
myLog = numpy.log(myTimeSeries) #log-transfor
myLog.plot()
adfuller(myLog) # p=0.165395
adfuller(myLog, maxlag=12) # p=0.997394
myDiff = myLog.diff(1) #difference with lag 1
myDiff.plot()
myDiff = myDiff.dropna()
adfuller(myDiff) # p=0.003884
adfuller(myDiff, maxlag=12) # p=0.613816
mySeasonalDiff = myDiff.diff(12) #seasonal differencing with lag 12
mySeasonalDiff.plot()
mySeasonalDiff = mySeasonalDiff.dropna()
adfuller(mySeasonalDiff) # p=0.000000
adfuller(mySeasonalDiff, maxlag=12) # p=0.958532
Looks like if I have to set maxlag=12, I need further transformation of my data, whereas if I can use the default maxlag, I can stop after taking the log and first difference. So I would like to know, how to use the ADF-Test properly.
Thanks for your help.