18

I am trying to detect anomalous values in a time series of climatic data with some missing observations. Searching the web I found many available approaches. Of those, stl decomposition seems appealing, in the sense of removing trend and seasonal components and studying the remainder. Reading STL: A Seasonal-Trend Decomposition Procedure Based on Loess, stl appears to be flexible in determining the settings for assigning variability, unaffected by outliers and possible to apply despite missing values. However, trying to apply it in R, with four years of observations and defining all the parameters according to http://stat.ethz.ch/R-manual/R-patched/library/stats/html/stl.html , I encounter error:

time series contains internal NAs

when na.action = na.omit, and

series is not periodic or has less than two periods

when na.action = na.exclude.

I have double checked that the frequency is correctly defined. I have seen relevant questions in blogs, but didn't find any suggestion that could solve this. Is it not possible to apply stl in a series with missing values? I am very reluctant to interpolate them, as I do not want to be introducing (and consequently detecting...) artifacts. For the same reason, I do not know how advisable it would be to use ARIMA approaches instead (and if missing values would still be a problem).

Please share if you know a way to apply stl in a series with missing values, or if you believe my choices are methodologically not sound, or if you have any better suggestion. I am quite new in the field and overwhelmed by the heaps of (seemingly...) relevant information.

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
effie pav
  • 181
  • 1
  • 1
  • 4
  • How many missing values do you have? You could try the `na.approx` function from the `zoo` package to interpolate your missing data. – Fernando Dec 04 '12 at 20:20
  • Some extra thoughts: apparently, now the package `stlplus` can handle missing values in ts decomposition. `Rbeast` may be another alternative; it is a Bayesian algorithm, Unlike `stl` that only decomposes time series, Rbeast does time series decomposition and changepoint detection at the same time, with missing values allowed. Here is an example: `library(Rbeast)`; `co2[ sample(1:length(co2), 200) ]=NA`; `plot(beast(co2))`. – zhaokg Jan 16 '22 at 01:46

2 Answers2

21

In the beginning of stl we find

x <- na.action(as.ts(x))

and soon after that

period <- frequency(x)
if (period < 2 || n <= 2 * period) 
    stop("series is not periodic or has less than two periods")

That is, stl expects x to be ts object after na.action(as.ts(x)) (otherwise period == 1). Let us check na.omit and na.exclude first.

Clearly, at the end of getAnywhere("na.omit.ts") we find

if (any(is.na(object))) 
    stop("time series contains internal NAs")

which is straightforward and nothing can be done (na.omit does not exclude NAs from ts objects). Now getAnywhere("na.exclude.default") excludes NA observations, but returns an object of class exclude:

    attr(omit, "class") <- "exclude"

and this is a different situation. As mentioned above, stl expects na.action(as.ts(x)) to be ts, but na.exclude(as.ts(x)) is of class exclude.

Hence if one is satisfied with NAs exclusion then e.g.

nottem[3] <- NA
frequency(nottem)
# [1] 12
na.new <- function(x) ts(na.exclude(x), frequency = 12)
stl(nottem, na.action = na.new, s.window = "per")

works. In general, stl does not work with NA values (i.e. with na.action = na.pass), it crashes deeper in Fortran (see full source code here):

z <- .Fortran(C_stl, ...

Alternatives to na.new are not delightful:

  • na.contaguous - finds the longest consecutive stretch of non-missing values in a time series object.
  • na.approx, na.locf from zoo or some other interpolation function.
  • Not sure about this one, but another one Fortran implementation can be found for Python here. One could use Python of possibly install R from source after some modifications, in case this module really allows missing values.

As we can see in the paper, there is no some simple procedure for missing values (like approximating them in the very beginning) which could be applied to the time series before calling stl. So considering the fact that original implementation is quite lengthy I would think about some other alternatives than whole new implementation.

Update: a quite optimal in many aspects choice when having NAs could be na.approx from zoo, so let us check its performance, i.e. compare results of stl with full data set and results when having some number of NAs, using na.approx. I am using MAPE as a measure of accuracy, but only for trend, because seasonal component and remainder crosses zero and it would distort the result. Positions for NAs are chosen at random.

library(zoo)
library(plyr)
library(reshape)
library(ggplot2)
mape <- function(f, x) colMeans(abs(1 - f / x) * 100)

stlCheck <- function(data, p = 3, ...){
  set.seed(20130201)
  pos <- lapply(3^(0:p), function(x) sample(1:length(data), x))
  datasetsNA <- lapply(pos, function(x) {data[x] <- NA; data})
  original <- data.frame(stl(data, ...)$time.series, stringsAsFactors = FALSE)
  original$id <- "Original"
  datasetsNA <- lapply(datasetsNA, function(x) 
    data.frame(stl(x, na.action = na.approx, ...)$time.series, 
               id = paste(sum(is.na(x)), "NAs"), 
               stringsAsFactors = FALSE))
  stlAll <- rbind.fill(c(list(original), datasetsNA))
  stlAll$Date <- time(data)
  stlAll <- melt(stlAll, id.var = c("id", "Date"))
  results <- data.frame(trend = sapply(lapply(datasetsNA, '[', i = "trend"), mape, original[, "trend"]))
  results$id <- paste(3^(0:p), "NAs")
  results <- melt(results, id.var = "id")
  results$x <- min(stlAll$Date) + diff(range(stlAll$Date)) / 4
  results$y <- min(original[, "trend"]) + diff(range(original[, "trend"])) / (4 * p) * (0:p)
  results$value <- round(results$value, 2)
  ggplot(stlAll, aes(x = Date, y = value, colour = id, group = id)) + geom_line() + 
    facet_wrap(~ variable, scales = "free_y") + theme_bw() +
    theme(legend.title = element_blank(), strip.background = element_rect(fill = "white")) + 
    labs(x = NULL, y = NULL) + scale_colour_brewer(palette = "Set1") +
    lapply(unique(results$id), function(z)
      geom_text(data = results, colour = "black", size = 3,
                aes(x = x, y = y, label = paste0("MAPE (", id, "): ", value, "%"))))
}

nottem, 240 observations

stlCheck(nottem, s.window = 4, t.window = 50, t.jump = 1)

enter image description here

co2, 468 observations

stlCheck(log(co2), s.window = 21)

enter image description here

mdeaths, 72 observations

stlCheck(mdeaths, s.window = "per")

enter image description here

Visually we do see some differences in trend in cases 1 and 3. But these differences are pretty small in 1 and also satisfactory in 3 considering sample size (72).

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • Right, it seems the implementation of `stl` in R is based on the Fortran implementation discussed in Section 4 (page 20) of the paper linked in the question, which eschews design goal 4 (page 5) in favor of speed (see bottom of page 21). I was hoping for a solution that does allow for missing values (an implementation of which apparently exists for S). Although I suspect that in most cases the differences between passing input through `na.approx` before `stl` wouldn't differ too greatly from an implementation of stl that handles missing values, and that is indeed my fallback option. – user1935457 Jan 31 '13 at 01:16
  • @user1935457, I updated my answer. Clearly the implementation of `stl` in R lacks some parts, but as I mentioned in my answer, it might not be worth rewriting it, especially if it really would not differ much as you said. Just thought about some tests with full data and with `na.approx`, I will update my answer with results.. – Julius Vainora Jan 31 '13 at 23:17
9

Realize this is an old question, but thought I'd update since there is a newer stl package available in R called stlplus. Here is its homepage on github. You can install it from CRAN with install.packages("stlplus") or directly from github with devtools::install_github("hafen/stlplus").

sfjac
  • 7,119
  • 5
  • 45
  • 69