1

I’m trying to determine how to set the span argument for geom_smooth() based on meaningful units from my data. As an example, let’s say I have a daily time series, with lower values on weekends (see bottom of post for data and code):

I’d like to smooth over a 7-day window, to smooth out the periodic dips due to weekends, but otherwise maximize resolution of the smoothed line — similar to a 7-day moving average.

My question is: how do I translate something like "7 days" into the correct value for span?

According to this SO answer, span sets the alpha parameter for the loess regression. The answer quotes Jacoby, 2000:

alpha gives the proportion of observations that is to be used in each local regression. Accordingly, this parameter is specified as a value between 0 and 1. The alpha value used for the loess curve in Fig. 2 is 0.65; so, each of the local regressions used to produce that curve incorporates 65% of the total data points.

Based on this, I tried setting span based on days per week (7) divided by the number of days in the data (nrow(mydata)):

library(ggplot2)

ggplot(mydata, aes(date, value)) +
  geom_point() +
  geom_smooth(se = FALSE, span = 7 / nrow(mydata))
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

But this doesn't smooth out the weekend dips:

Data:

library(tidyverse)
library(lubridate)
set.seed(1)

mydata <- tibble(
    date = seq(ymd("2020-01-01"), ymd("2020-04-01"), by = 1)
  ) %>%
  mutate(
    value = if_else(
      weekdays(date) %in% c("Saturday", "Sunday"),
      rnorm(n(), 10, 3),            # lower values on weekends
      rnorm(n(), 50, 10)
    ),
    value = if_else(
      date > ymd("2020-02-15"),
      value + rnorm(n(), 20, 2),    # stepwise increase after Feb 15
      value
    )
  )

ggplot(mydata, aes(date, value)) +
  geom_point() 
zephryl
  • 14,633
  • 3
  • 11
  • 30

1 Answers1

3

Here is an alternate workflow that you might consider. You might consider changing your data to a time series version using the tsibble package. Rob Hyndman has an amazing book with details on how to do it.

Here is the default loess.

ggplot(mydata, aes(date, value)) +
  geom_point() + geom_smooth(se = FALSE)

enter image description here

Now I'm converting to a time series.

mydata_tsibble <- as_tsibble(mydata)
autoplot(mydata_tsibble, value)

enter image description here

You can use a decomposition to see the different time series parts. The moving average is like the trend-cycle, the average for each season within the de-trended series is the seasonal component, and the remainder is what is left over (more or less - (Here I'm describing classical decomposition more than STL.))

mydata_tsibble %>%
  model(
    STL(value ~ trend(window = 7) +
                   season(window = "periodic"),
    robust = TRUE)) %>%
  components() %>%
  autoplot() +
    theme(plot.title = element_text(face = "bold")) 

enter image description here

If you want the moving average you can use slider and you could smooth it further by taking a moving average of a moving average. As far as I know, this doesn't correspond exactly with the way local regressions from loess are performed, but it may be more intuitive and, with the right way of calculating a moving average, informed from the decomposition, you could get a line that is similar.

mydata_tsibble_ma <- mydata_tsibble %>%
      arrange(date) %>%
      mutate(
        `7-MA` = slider::slide_dbl(value, mean,
                    .before = 3, .after = 3, .complete = TRUE),
        `2x7-MA` = slider::slide_dbl(`7-MA`, mean,
                    .before = 1, .after = 0, .complete = TRUE))

mydata_tsibble_ma %>%
  autoplot(value, colour = "gray") +
  geom_line(aes(y = `2x7-MA`), colour = "#D55E00") 
  

enter image description here

hachiko
  • 671
  • 7
  • 20