22

I have many data sets with known outliers (big orders)

data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1", 155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5, 135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6, 222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6, 231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6, 329429882.8, 264012891.6, 496745973.9, 284484362.55),ncol=2,byrow=FALSE)

The top 11 outliers of this specific series are:

outliers <- matrix(c("14Q4","14Q2","12Q1","13Q1","14Q2","11Q1","11Q4","14Q2","13Q4","14Q4","13Q1",20193525.68, 18319234.7, 12896323.62, 12718744.01, 12353002.09, 11936190.13, 11356476.28, 11351192.31, 10101527.85, 9723641.25, 9643214.018),ncol=2,byrow=FALSE)

What methods are there that i can forecast the time series taking these outliers into consideration?

I have already tried replacing the next biggest outlier (so running the data set 10 times replacing the outliers with the next biggest until the 10th data set has all the outliers replaced). I have also tried simply removing the outliers (so again running the data set 10 times removing an outlier each time until all 10 are removed in the 10th data set)

I just want to point out that removing these big orders does not delete the data point completely as there are other deals that happen in that quarter

My code tests the data through multiple forecasting models (ARIMA weighted on the out sample, ARIMA weighted on the in sample, ARIMA weighted, ARIMA, Additive Holt-winters weighted and Multiplcative Holt-winters weighted) so it needs to be something that can be adapted to these multiple models.

Here are a couple more data sets that i used, i do not have the outliers for these series yet though

data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3", 26393.99306, 13820.5037, 23115.82432,    25894.41036,    14926.12574,    15855.8857, 21565.19002,    49373.89675,    27629.10141,    43248.9778, 34231.73851,    83379.26027,    54883.33752,    62863.47728,    47215.92508,    107819.9903,    53239.10602,    71853.5,    59912.7624, 168416.2995,    64565.6211, 94698.38748,    80229.9716, 169205.0023,    70485.55409,    133196.032, 78106.02227), ncol=2,byrow=FALSE)

data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3",3311.5124,    3459.15634, 2721.486863,    3286.51708, 3087.234059,    2873.810071,    2803.969394,    4336.4792,  4722.894582,    4382.349583,    3668.105825,    4410.45429, 4249.507839,    3861.148928,    3842.57616, 5223.671347,    5969.066896,    4814.551389,    3907.677816,    4944.283864,    4750.734617,    4440.221993,    3580.866991,    3942.253996,    3409.597269,    3615.729974,    3174.395507),ncol=2,byrow=FALSE)

If this is too complicated then an explanation of how, in R, once outliers are detected using certain commands, the data is dealt with to forecast. e.g smoothing etc and how i can approach that writing a code myself (not using the commands that detect outliers)

  • This question is more about statistics not about programming. Can you move this to Cross validated? – forecaster Apr 19 '15 at 12:47
  • Is your last observation correct? It seems to be off by a factor of 10 and has a different format. – J.R. Apr 21 '15 at 07:39
  • yes sorry i will edit it now – Summer-Jade Gleek'away Apr 21 '15 at 09:16
  • 1
    How do you know what points are outliers? You mention all these weighted methods, do you mean that you want to downweigh the known outliers by some fixed amount that you have determined using other methods? Or would you consider a model that provides a level of smoothing and thus "ignores" outliers without being told which ones they are? – konvas Apr 21 '15 at 10:54
  • we have a file which contains all the 'big orders' so currently we take the top 11 and run the series 11 times replacing the biggest outlier with the next biggest, or just remove them completely. We were trying to see if there were other ways of doing it as the results were never quite accurate to the actual value acheived that quarter. any methods are helpful. if downweighing the outliers is possible then yes id like to try that too – Summer-Jade Gleek'away Apr 21 '15 at 11:42
  • Just to clarify, are you trying to forecast your 'data' series or your 'data' series - 'outliers' series or something else? – WaltS Apr 22 '15 at 13:17
  • i am trying to forecast the data series. the first one i have given has outliers that are currently known which is the sort of method that i would like help with. the other two data series are just to show that the outliers arent always seasonal but i currently dont have access to the outliers that are known in this series. does this answer your question? – Summer-Jade Gleek'away Apr 23 '15 at 08:07
  • 1. I don't understand on which ground you call them outliers. As konvas mentioned, they tend to be Q4 and Q1 and there may be a reason. You may try with monthly data as detailed data can show a pattern that fails to be identified by aggregation. – Jaehyeon Kim Apr 24 '15 at 11:22
  • 2. I'm not sure why you use the original scale for ARIMA modelling. Maximum likelyhood estimation basically assumes the normal distribution. Although ML estimators are asymptotically normal, # records in your data is too low. In econometrics, this type of variables are converted into rates of changes. At least, you'd have to 'log-transform' the records <- some of the outliers may be no longer an issue. – Jaehyeon Kim Apr 24 '15 at 11:22
  • 3. Smoothing techniques may be better as # records are so small. Holt-Winters, exponential smoothing, moving average ... would be better then ARIMA modelling. Still, however, I highly recommend to use monthly data and I'm sure you are able to obtain them. 4. In conclusion, it is unknown that your data has many outliers but the models and data may have to be revised. – Jaehyeon Kim Apr 24 '15 at 11:22

3 Answers3

6

Your outliers appear to be seasonal variations with the largest orders appearing in the 4-th quarter. Many of the forecasting models you mentioned include the capability for seasonal adjustments. As an example, the simplest model could have a linear dependence on year with corrections for all seasons. Code would look like:

df <- data.frame(period= c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3",
                       "10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2",
                       "13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1"),
                 order= c(155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5,
                        135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6,
                        222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6,
                        231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6,
                        329429882.8, 264012891.6, 496745973.9, 42748656.73))

seasonal <- data.frame(year=as.numeric(substr(df$period, 1,2)), qtr=substr(df$period, 3,4), data=df$order)
ord_model <- lm(data ~ year + qtr, data=seasonal)
seasonal <- cbind(seasonal, fitted=ord_model$fitted)
library(reshape2)
library(ggplot2)
plot_fit <- melt(seasonal,id.vars=c("year", "qtr"), variable.name = "Source", value.name="Order" )
ggplot(plot_fit, aes(x=year, y = Order, colour = qtr, shape=Source)) + geom_point(size=3)

which gives the results shown in the chart below: Linear fit with seasonal adjustments

Models with a seasonal adjustment but nonlinear dependence upon year may give better fits.

WaltS
  • 5,410
  • 2
  • 18
  • 24
4

enter image description here
The approach you are trying to use to cleanse your data of outliers is not going to be robust enough to identify them. I should add that there is a free outlier package in R called tsoutliers, but it won't do the things I am about to show you....

You have an interesting time series here. The trend changes over time with the upward trend weakening a bit. If you bring in two time trend variables with the first beginning at 1 and another beginning at period 14 and forward you will capture this change. As for seasonality, you can capture the high 4th quarter with a dummy variable. The model is parsimonios as the other 3 quarters are not different from the average plus no need for an AR12, seasonal differencing or 3 seasonal dummies. You can also capture the impact of the last two observations being outliers with two dummy variables. Ignore the 49 above the word trend as that is just the name of the series being modeled. Actual, Fit, Forecasts with Confidence limits

Tom Reilly
  • 350
  • 2
  • 8
  • 4
    The result seems nice, but you're not actually giving the solution/algorithm you used... – patapouf_ai Apr 19 '15 at 11:23
  • The Box-Jenkins process didn't search for changes in trend, level, parameters, variance or outliers, but you need to do that to identify the patterns. Not all models will have ARIMA in them and rely upon determinstic variables(stepup regression if you will). See here for more on the process http://bit.ly/18AGPES All models are wrong and some are useful. You might find this discussion useful http://bit.ly/1Q5BWWs – Tom Reilly Apr 19 '15 at 15:23
  • 2
    could you please give the R code that you used? i cant quite follow the process – Summer-Jade Gleek'away Apr 21 '15 at 09:39
  • I didn't use R code. Is that mandatory? If you want to dig deeper take a look at Tsay's paper Outliers, level shifts, and variance changes in time series www.unc.edu/~jbhill/tsay.pdf and Balke's paper Detecting Level Shifts in Time Series http://bit.ly/1yLwLW5 – Tom Reilly Apr 21 '15 at 11:37
  • the method that i am looking for does need to be adapted to an R code. – Summer-Jade Gleek'away Apr 23 '15 at 08:08
  • It seems that you are constrained in your options. For the 2nd data set, the 20th period is an outlier. Seasonal differencing with one outlier, but the outlier also has seasonal differencing [(1-B**4)]Y(T) = 15729. +[X1(T)][(1-B**4)][(+ 29904. )] :PULSE 20 + + [A(T)] – Tom Reilly Apr 23 '15 at 13:40
  • For the 3rd data set 17 is an outlier. Starting at period 7(3rd quarter) shows that is is systematically lower than the other quarters and is identified as a "seasonal pulse" or "a change in the seasonality" from that point on. – Tom Reilly Apr 23 '15 at 13:47
  • 3rd data set model. An AR1, level shift, outlier, change in level and a seasonal pulse Y(T) = 3253.6 +[X1(T)][(+ 1044.0 )] :LEVEL SHIFT 8 +[X2(T)][(- 651.29 )] :SEASONAL PULSE 7 +[X3(T)][(+ 1021.8 )] :PULSE 17 + [(1- .627B** 1)]**-1 [A(T)] – Tom Reilly Apr 23 '15 at 13:47
4

You already said you tried different Arima-models, but as mentioned by WaltS, your series don't seem to contain big outliers, but a seasonal-component, which is nicely captured by auto.arima() in the forecast package:

myTs <- ts(as.numeric(data[,2]), start=c(2008, 1), frequency=4) 
myArima <- auto.arima(myTs, lambda=0)
myForecast <- forecast(myArima)
plot(myForecast)

enter image description here

where the lambda=0 argument to auto.arima() forces a transformation (or you could take log) of the data by boxcox to take the increasing amplitude of the seasonal-component into account.

J.R.
  • 3,838
  • 1
  • 21
  • 25