10

I am using the mgcv package to model the ozone pollution concentration according to some environmental covariates. The model takes the form :

model1 <- gam(O3 ~ s(X, Y, bs = "tp", k = 10) + wd + s(date, bs = "cc", k = 100) + district,
              data = mydata, family = gaussian(link ="log"),
              na.action = "na.omit", method = "REML")

And here is the structure of covariates:

> str(mydata)
'data.frame': 7100 obs. of  286 variables:
 $ date            : Date, format: "2016-01-01" "2016-01-01" "2016-01-01" ...
 $ O3              : num  0.0141 0.0149 0.0102 0.0159 0.0186 ...
 $ district        : Factor w/ 10 levels "bc","bh","dl",..: 1 8 7 8 2 6 4 4 10 2 ...
 $ wd              : Factor w/ 16 levels "E","ENE","ESE",..: 13 13 13 13 13 2 9 9 11 13 ...
 $ X               : num  0.389 0.365 1 0.44 0.892 ...
 $ Y               : num  0.311 0.204 0.426 0.223 0.162 ...

I am stuck on an

error in R: 'names' attribute [1] must be the same length as the vector [0].

I try to find where the problem is by delete the term of s(date, bs = "cc", k = 100) from the fomular and it could work well. It seems like there is something wrong with date field.

I'm not exactly sure how to fix this problem. Any advice would be greatly appreciated!

Z.Lin
  • 28,055
  • 6
  • 54
  • 94
王嘉炜
  • 101
  • 1
  • 1
  • 3
  • looks like you will need to convert the date class to numeric. (ps; I have most often seen the cyclic spline used with months / weeks etc rather than raw dates) – user20650 Jan 04 '19 at 08:55
  • 2
    The answers given are sufficient, but I'll add that I had the same error when using a character variable that I was treating as a factor, but had not formally defined it as such. The problem was resolved after transforming it "mutate(var1 = as.factor(var.1))" – Kodiakflds Apr 09 '21 at 15:55
  • Linking to [this Stack Exchange answer](https://stats.stackexchange.com/a/515250)In case helpful for people like me who forget that all character variables need to be converted to factors. – Corey N. Aug 25 '23 at 16:23

1 Answers1

11

The date variable won't be automatically converted to a numeric variable; you need to do this yourself. I normally process such information as follows

mydata <- transform(mydata, ndate = as.numeric(date),
                    nyear  = as.numeric(format(date, '%Y')),
                    nmonth = as.numeric(format(date, '%m')),
                    doy    = as.numeric(format(date, '%j')))

Then I can choose to model the time component in a number of ways:

  1. trend based on ndate of nyear with a non-cyclic spline, or
  2. cyclic pattern based on nmonth or doy (for day of year), or
  3. a combination of trend and cyclic pattern

It's unclear from your question if your data are restricted to a single year. If the data span multiple years then you can't just use the cyclic spline on the ndate variable. You will need either a very complex standard spline (option 1) or include two splines, one for the between year part and one for the within year part (option 3.)

If your data is over multiple years then I would set the model up as

O3 ~ s(X, Y, bs = "tp", k = 10) + wd + s(doy, bs = 'cc', k = 20) +
     s(ndate, bs = "tp", k = 50) + district

or perhaps s(nyear, .... ) will be sufficient instead of s(ndate, .... ).

This kind of decomposition of the time component is useful as you can often do a better job of fitting the series via two simple, well-estimated smooths than a single more complex smooth. It also allows you to test for within and between year effects.

If you need the seasonal cycle to vary with the trend, then a tensor product is helpful:

O3 ~ s(X, Y, bs = "tp", k = 10) + wd +
     te(doy, ndate, bs = c('cc','tp'), k = c(20,50)) + district

For cyclic splines you may also want to set the knots argument, especially if your data don't quite span the full range of days of year etc. For doy I would use knots = list(doy = c(0.5, 366.5)) as this allows Dec 31st and Jan 1st to have slightly different estimated values. For nmonth this is more important as otherwise Dec and Jan would get the same fitted value. I use: knots = list(nmonth = c(0.5, 12.5)).

The idea here is that 1 and 12 reflect the middle of the respective month and 0.5 and 12.5 the beginning and end of the first and last months, which we might expect to be the same.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453