8

I have had great success using gam to model seasonality for time series data. My latest model clearly shows a weekly pattern in addition to seasonal changes. While the weekly pattern itself is very stable over the year, its amplitude also varies with the seasons. So ideally I would like to model my data as:

y ~ f(day in year) + g(day in year) * h(day in week)

where f, g and h are cyclic smooth functions in mgcv

gam(
  y ~ s(day_in_year, k=52, bs='cc') 
  + s(day_in_year, k=52, bs='cc'):s(day_in_week, k=5, bs='cc')
  , knots=list(
    day_in_year=c(0, 356)
    , day_in_week=c(0,7)
  )
  , data = data
)

Unfortunately this doesn't work and throws back the error NA/NaN argument. I tried using te(day_in_year, day_in_week, k=c(52, 5), bs='cc') which works, but introduces too many degrees of freedom as the model overfits holidays which fall on specific weekdays within the short number of available years.

Is it possible to specify a model the way I am trying to do?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
unique2
  • 2,162
  • 2
  • 18
  • 23
  • I am also surprised that this is not possible. It seems like a fairly common problem to have a spline `s(x1)` that you want to scale/move based on an other spline `s(x2)`, without changing the shape of either spline. If we could make `te()` create a rank 1 matrix, that would give the desired result right? – JohannesNE Nov 23 '21 at 09:09

3 Answers3

6

Wow, quite an old question!

Regarding interaction

While the weekly pattern itself is very stable over the year, its amplitude also varies with the seasons.

The use of tensor product spline basis te is the right way for interaction, although a more decent constructor is ti. You said that te returns you many parameters. Of course. You have k = 52 for the first margin, and k = 5 for the second, then you end up with 52 * 5 - 1 coefficients for this tensor term. But this is just the way an interaction is created.

Note that in mgcv GAM formula, : or * is only valid for interaction between parametric terms. Interaction between smooths are handled by te or ti.

If this is not what you hope for, then what do you expect the "product" to be? A Hadamard product of two marginal design matrices? Then what's the sense of that? By the way, a Hadamard product requires two matrices of the same dimension. However, your two margins do not have the same number of columns.

If you don't understand why I keep talking about matrices, then you need to read Simon's book in 2006. Though GAM estimation explained there is rather outdated now, the construction / setup of GAM (like design matrices and penalty matrices) explained in Chapter 4 does not change at all even after a decade.

OK , let me give you one more hint. The row wise Kronecker product used for construction of te / ti design matrix is not a new invention.

A smooth term s(x) is quite like a factor variable g, as though they appear to be a single variable, they are constructed to be a design matrix with many columns. For g it is a dummy matrix, while for f(x) it is a basis matrix. So, interaction between two smooth functions is constructed in the same way that interaction between two factors is constructed.

If you have a factor g1 of 5 levels, and another factor g2 of 10 levels, their marginal design matrix (after contrasts) have 4 columns and 9 columns, then the interaction g1:g2 would have 36 columns. Such design matrix, is just the row wise Kronecker product of the design matrix of g1 and g2.


Regarding overfitting

As you said, you only have few years of data, maybe 2 or 3? In that case, your model have been over-parametrized by using k = 52 for day_in_year. Try reduces it to for example k = 30.

If overfitting is still evident, here are a few ways to address it.

Way 1: You are using GCV for smoothness selection. Try method = "REML". GCV always inclines to overfit data.

Way 2: Staying with GCV, manually inflate smoothing parameters for a heavier penalization. gamma parameter of gam is useful here. Try gamma = 1.5 for example.


A typo in your code?

The knots location, should it be day_in_year = c(0, 365)?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • The model you suggest doesn't have the form y ~ f(day in year) + g(day in year) * h(day in week) that I asked for in my question. It rather takes the form y ~ f(day in year) + g(day in year, day in week). With mgcv it is simply not possible to build such a model. – unique2 Jun 07 '17 at 18:43
6

Your model doesn't make much sense as you state that there is:

  1. a seasonal effect,
  2. a day of week effect,
  3. an interaction such that the day of week effect varies with the day of year

This can be fully represented as a tensor product smooth. The model you mention in the comment on the other answer here

y ~ f(day in year) + g(day in year) * h(day in week)

is just a decomposition of the full tensor product if you mean * as main effect + interaction. In which case the model as you have it isn't identifiable - you ave a function of day of year twice. And if you meant the equivalent of :, then your model doesn't have the main effect of day of week, which seems undesirable.

I fit models of this form all the time (just for year and day of year). I would approach this via:

gam(y ~ te(day_of_year, day_of_week, k = c(20, 6), bs = c("cc", "cc")),
    data = foo, method = "REML", knots = knots)

You could also tweak the knots definitions. I tend to use the following:

knots <- list(day_of_year = c(0.5, 366.5),
              day_of_week = c(0.5, 7.5)

This won't make much difference but you're just putting boundary knots just a little closer to the data.

If you want to decompose the effects, you could fit the model with ti() smooths

gam(y ~ ti(day_of_year, bs = "cc", k = 12) + ti(day_of_week, bs = "cc", k = 6) + 
      te(day_of_year, day_of_week, k = c(12, 6), bs = c("cc", "cc")),
    data = foo, method = "REML", knots = knots)

You can tweak the values of k to determine suitable values in conjunction with your data and gam.check().

You will also need to add a term to the model to handle holidays. This would be parametric term which applies an adjustment if the day is a holiday - so, create a factor holiday and add it to the model + holiday. You could think of more complex models; perhaps a factor indexing if a week has a holiday coupled with a by factor smooth for the day_of_week component, such that you estimate one weekly pattern if the week is a normal week and a second weekly pattern if the week contains a holiday.

If you show us an example/plot of the data I could expand or comment less generally.

It isn't surprising that the te() smooth you fitted doesn't adapt well to the holidays; the model assumes that the week effect is smooth and smoothly varying with a smooth day of year effect. Holidays are not smooth departures from the weekly pattern in the previous or subsequent week. The holiday effect is not well modelled by smooth relationships and something else is need to account for that effect.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
0

My question is now a couple of years old and I want to add the solution which ended up working best for me.

First, it's impossible to fit the model as described in my question using mgcv.

For some time I used a two stage process.

model1 = gam(
  y ~ s(day_in_year, k=52, bs='cc') 
  + s(day_in_year, k=52, bs='cc')
  + as.factor(day_in_week)
  , knots=list(
    day_in_year=c(0, 366)
    , day_in_week=c(0,7)
  )
  , data = data
)
# get_weekday_offset gets the coefficients for each weekday and normalizes them to have mean 0
data$weekday_offset = get_weekday_offset(model1)[data$day_in_week]

model2 = gam(
  y ~ s(day_in_year, k=52, bs='cc') 
  + s(day_in_year, k=52, bs='cc')
  + s(day_in_year, k=52, bs='cc', by=weekday_offset)
  + as.factor(day_in_week)
  , knots=list(
    day_in_year=c(0, 366)
    , day_in_week=c(0,7)
  )
  , data = data
)

The second model has the form "y ~ f(day in year) + g(day in year) * h(day in week)", but h is a fixed function and not fitted to the data in the second model, but based on a best fit in model1. The obvious drawback of this is that mgcv needs to be run twice and almost all information (other than the coefficients for the weekdays) from the first run are thrown away.

The reason I finally abandoned this model was that on top of changes to the amplitude there were indeed some changes to the shape over the course of the year, so I finally settled on looked like this.

gam(
  y ~ s(day_in_year, k=52, bs='cc') 
  + s(day_in_year, k=10, bs='cc', by=as.factor(day_in_week)
  + as.factor(day_in_week)
  , knots=list(
    day_in_year=c(0, 366)
    , day_in_week=c(0,7)
  )
  , data = data
)

I dropped the s for the weekday, since the data I had justified using a maximum amount of degrees of freedom for the weekly pattern, and I added an interaction effect between the weekday and the day of the year (s(day_in_year, k=12, bs='cc', by=as.factor(day_in_week)), this estimates a separate seasonality curve for each weekday, however since it contains substantially lower number of degrees of freedom than the seasonality term s(day_in_year, k=52, bs='cc'), I don't run into the issues I had when trying te terms.

unique2
  • 2,162
  • 2
  • 18
  • 23