mgcv: How to do stepwise regression with a Tweedie response model?

Question

Does anyone have an idea how to do stepwise regression with Tweedie in R?

I found the mgcv package, which apparently treats the power parameter of Tweedie as yet another parameter to be estimated. This seems to improve on having to use tweedie.profile to estimate the power outside the glm, so it seems encouraging for using an automated stepwise function to do the regression. But I haven't been able to figure out if the package also offers a stepwise function. The package manual has this to say.

I got lost in the talk about smooths:

There is no step.gam in package mgcv.
To facilitate fully automatic model selection the package implements two smooth modification techniques which can be used to allow smooths to be shrunk to zero as part of smoothness selection.

I would appreciate your help. Thanks.

Hi Zheyuan. Yes, very helpful, thanks very much. I need to sit down and understand how this smoothing works. Quick question: does the Tweedie function of mgcv also feature Select = TRUE and Method = RELM ? Thanks again — Alp Can, May 20 '16 at 19:37
I see that you can either use family=Tweedie in gam, or you can use the Tweedie function. Not sure about their difference yet. That's why I asked if select and method could be used in Tweedie as well. But what you said makes sense, thanks. — Alp Can, May 20 '16 at 21:04

Zheyuan Li · Accepted Answer · 2016-07-05T01:05:18.813

Your question is not specific to "Tweedie" family; it is a general mgcv feature in model selection.

mgcv does not use step.gam for model selection. I think your confusion comes from another package gam, which would use step.gam to sequentially add/drop a term and reports AIC. When you go ?step.gam in mgcv, it refers you to ?gam.selection. ?step.gam is intentionally left there, in case people search it. But all the details are provided in ?gam.selection.

There is no need to do step.gam in mgcv. Model estimation and model selection are integrated in mgcv. For a penalized regression/smoothing spline, when smoothing parameter goes to infinity (very large), its second derivative is penalized to zero, leaving a simple linear term. For example, if we specify a model like:

y ~ s(x1, bs = 'cr') + s(x2, bs = 'cr')

while s(x2) is a spurious model term and should not be included in the model, then mgcv:::gam/bam will shrink s(x2) to x2 after estimation, resulting a model like:

y ~ s(x1) + x2

This means, when you use plot.gam() to inspect the estimated smooth function for each model term, s(x1) is a curve, but s(x2) is a straight line.

Now this is not entirely satisfying. For a complete, successful model selection, we want to drop x2 as well, i.e., shrink s(x2) to 0, to get notationally a model:

y ~ s(x1)

But this is not difficult to achieve. We can use shrinkage smooth class bs = 'ts' (shrinkage thin plate regression spline, as opposed to the ordinary tp) or bs = cs' (shrinkage cubic regression spline, as opposed to the ordinary 'cr'), and mgcv:::gam/bam should be able to shrink s(x2) to 0. The math behind this, is that mgcv will modify the eigen values of linear term (i.e., the null space) from 0, to 0.1, a small but positive number, so that penalization takes effect on linear term. As a result, when you do plot.gam(), you will see s(x2) is a horizontal line at 0.

bs = 'cs' or bs = 'ts' are supposed to be put in function s(); yet mgcv also allows you to leave bs = 'cr' or bs = 'tp' untouched in s(), but put select = TRUE in gam() or bam(). The select = TRUE is a more general treatment, as shrinkage smooths at the moment only have class cs and ts, while select = TRUE work for all kind of smooth specification. They essentially do the same thing, by increasing 0 eigen values to 0.1.

The following example is taken from the example under ?gam.selection. Note how select = TRUE shrinks several terms to 0, giving an informative model selection.

library(mgcv)
set.seed(3);n<-200
dat <- gamSim(1,n=n,scale=.15,dist="poisson") ## simulate data
dat$x4 <- runif(n, 0, 1);dat$x5 <- runif(n, 0, 1) ## spurious
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5),data=dat,
        family=poisson,select=TRUE,method="REML")
summary(b)
plot.gam(b,pages=1)

Note that, the p-values in summary.gam() also gives evidence for such selection:

Approximate significance of smooth terms:
            edf Ref.df  Chi.sq p-value    
s(x0) 1.7655119      9   5.264  0.0397 *  
s(x1) 1.9271039      9  65.356  <2e-16 ***
s(x2) 6.1351372      9 156.204  <2e-16 ***
s(x3) 0.0002618      9   0.000  0.4088    
s(x4) 0.0002766      9   0.000  1.0000    
s(x5) 0.1757146      9   0.195  0.2963    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.545   Deviance explained = 51.6%
-REML = 430.78  Scale est. = 1         n = 200

Zheyuan. Good answer. May I ask for some thoughts please: in your example the coefficients of x3, x4, (and possible x5??) have been shrunk towards zero, so this is similar to a variable selection?. Would you refit the model without these? (as in my case, if many variables in a model are shrunk to zero then presenting all these zero coefficients in a model would, I think be problematic with a reviewer. — user2957945, Jul 11 '16 at 19:04

mgcv: How to do stepwise regression with a Tweedie response model?

1 Answers1