seasonal decomposition on non-uniform spaced time series, any well-established algo in R or Python?

Question

stats package in R has stl(), but it requires a uniformly spaced time series created by ts(). It can't deal with zoo objects.

Strangely, it can't deal with missing values either, although STL method claims to be able to fill in missing value with LOESS. (see this question on CV.)

So for example, if you have business day data, you can't just make it calendar day by putting NA on weekends and call stl().

I also see the author of Python statsmodel trying to migrate stl() to work with Pandas TimeSeries, but it doesn't seem to be there yet.

Thanks

Edit: Just to add that I know I can just do a very simple model like fitting harmonics, but I want a well-established model to at least provide benchmark. I have sub monthly data, so X12 is not applicable.

score 1 · Answer 1 · edited May 23 '17 at 12:25

According to @Julius in this post is it possible to use stl with na.approx, from zoo package, using stl(x, na.action = na.approx, ...). This does some sort of interpolation.

Unfortunately stl prefers regular time series.

So I take @Julius approach to test how well loess performs in filling NAs assuming a irregular time series. Apparently the error is very small.

Lets simulate

The following function takes the data and simulate scenarios with 3^(0:p) NAs with naapprox=TRUE for using na.approx and optspan=TRUE to optimize span parameter. The loss function is MAPE.

#The approach is based in @Julius stl with NAs
library(zoo)
library(plyr)
library(reshape)
library(ggplot2)
mape <- function(f, x) colMeans(abs(1 - f / x) * 100,na.rm=T)

loessCheck <- function(data,p=2, naapprox=TRUE, optspan=TRUE){
  set.seed(20130201)
  pos <- lapply(3^(0:p), function(x) sample(1:length(data), x))
  datasetsNA <- lapply(pos, function(x) {data[x] <- NA; data})
  original <- data.frame(y.predict=as.numeric(data))
  original$id <- "Original"
  datasetsNA <- lapply(datasetsNA, function(y){ 
    posna=which(is.na(y)) 
    tvo=1:length(y) 
    yo=y;tv=tvo
    if (any(posna%in%c(1,length(y)))) tv=tvo[-posna[which(posna%in%c(1,length(y)))]]
    if(naapprox) y=na.approx(y) #instead of inside loess
    if(optspan){ fseq=function(x, y)
      mape(matrix(predict(loess(y ~ dt, span=x, data.frame(dt=tv, y=y))),ncol=1),as.vector(y[!is.na(y)]))
    ospan <- optimize(fseq, c(0.1,1), maximum=FALSE,y=y)
    spanmim <- ospan$minimum
    } else spanmim <- 0.75
    y.loess <- loess(y ~ dt, span=spanmim, data.frame(dt=tv, y=y))
    y.predict <- predict(y.loess, data.frame(dt=tvo))
    y.predict[-posna] <- yo[-posna]
    data.frame(y.predict, 
               id = paste(length(posna), "NAs"), 
               stringsAsFactors = FALSE)
    })
  loessAll <- rbind.fill(c(list(original), datasetsNA))
  loessAll$Date <- time(data)
  results <- data.frame(y.predict = sapply(lapply(datasetsNA, '[', i = "y.predict"), mape, original[, "y.predict"]))
  results$id <- unique(loessAll$id[-(1:nrow(original))])
  results <- melt(results, id.var = "id")
  results$x <- min(loessAll$Date) + diff(range(loessAll$Date)) / 4
  results$y <- min(original[, "y.predict"],na.rm=T) + diff(range(original[, "y.predict"],na.rm=T)) / (4 * p) * (0:p)
  results$value <- round(results$value, 2)
  ggplot(loessAll, aes(x = Date, y = y.predict, colour = id, group = id)) + geom_line() + 
     theme_bw() +
    theme(legend.title = element_blank(), strip.background = element_rect(fill = "white")) + 
    labs(x = NULL, y = NULL) + scale_colour_brewer(palette = "Set1") +
    lapply(unique(results$id), function(z)
      geom_text(data = results, colour = "black", size = 3,
                aes(x = x, y = y, label = paste0("MAPE (", id, "): ", value, "%"))))
}
args(loessCheck) # first boolean for using na.approx, the second for optimize span parameter in sample


loessCheck(nottem,p=3) #T T

enter image description here