4

I am using multivariate GAM models to learn more about fog trends in multiple regions. Fog is determined by visibility going below a certain threshold (< 400 meters). Our GAM model is used to determine the response of visibility to a range of meteorological variables.

However, my challenge right now is that I'd really like the y-axis to be the actual visibility observations rather than the centered smoothed. It is interesting to see how visibility is impacted by the covariates relative to the mean visibility in that location, but it's difficult to compare this for multiple locations where the mean visibility is different (and thus the 0 point in which visibility is enhanced or diminished has little comparable meaning).

In order to compare the results of multiple locations, I'm trying to make the y-axis actual visibility observations, and then I'll put a line at the visibility threshold we're interested in looking at (400 m) to evaluate what the predictor variables values are like below that threshold (eg what temperatures are associated with visibility below 400 m).

I'm still a beginner when it comes to GAMs and R in general, but I've figured out a few helpful pieces so far.

Helpful things so far:

Attempt 1. how to extract gam fit for each variable in model Extracting data used to make a smooth plot in mgcv

Attempt 2. how to use predict function to reconstruct a univariable model http://zevross.com/blog/2014/09/15/recreate-the-gam-partial-regression-smooth-plots-from-r-package-mgcv-with-a-little-style/

Attempt 3. how to get some semblance of a y-axis that looks like visibility observations using "fitted" -- though I don't think this is the correct approach since I'm not taking the intercept into account http://gsp.humboldt.edu/OLM/R/05_03_GAM.html

simulated data

install.packages("mgcv") #for gam package
require(mgcv)
install.packages("pspline")
require(pspline)


#simulated GAM data for example
dataSet <- gamSim(eg=1,n=400,dist="normal",scale=2)
visibility <- dataSet[[1]]
temperature <- dataSet[[2]]
dewpoint <- dataSet[[3]]
windspeed <- dataSet[[4]]


#Univariable GAM model
gamobj <- gam(visibility ~  s(dewpoint))
plot(gamobj, scale=0, page=1, shade = TRUE, all.terms=TRUE, cex.axis=1.5, cex.lab=1.5, main="Univariable Model: Dew Point")
summary(gamobj)
AIC(gamobj)
abline(h=0)

Univariable Model of Dew Point https://i.stack.imgur.com/Q5jYj.jpg

ATTEMPT 2 -- predict function with univariable model, but didn't change y-axis

#dummy var that spans length of original covariate
maxDP <-max(dewpoint)
minDP <-min(dewpoint)
DPtrial.seq <-seq(minDP,maxDP,length=3071)
DPtrial.seq <-data.frame(dewpoint=DPtrial.seq)

#predict only the DP term 
preds <- predict(gamobj, type="terms", newdata=DPtrial.seq, se.fit=TRUE)

#determine confidence intervals
DPplot <-DPtrial.seq$dewpoint
fit <-preds$fit
fit.up95 <-fit-1.96*preds$se.fit
fit.low95 <-fit+1.96*preds$se.fit

#plot
plot(DPplot, fit, lwd=3,
 main="Reconstructed Dew Point Covariate Plot")

#plot confident intervals
polygon(c(DPplot, rev(DPplot)), 
    c(fit.low95,rev(fit.up95)), col="grey",
    border=NA)

lines(DPplot, fit,  lwd=2)
rug(dewpoint) 

Reconstructed Dew Point Covariate Plot https://i.stack.imgur.com/YtLqE.jpg

ATTEMPT 3 -- changed y-axis using "fitted" but without taking intercept into account

plot(dewpoint,fitted(gamobj), main="Fitted Response of Y (Visibility) Plotted Against Dew Point")
abline(h=mean(visibility))
rug(dewpoint)

Fitted Response of Y Plotted Against Dew Point https://i.stack.imgur.com/cQ3v5.jpg

Ultimately, I want a horizontal line where I can investigate the predictor variable relative to 400 meters, rather than just the mean of the response variable. This way, it will be comparable across multiple sites where the mean visibility is different. Most importantly, it needs to be for multiple covariates!

Gavin Simpson has explained the method in a couple of posts but unfortunately, I really don't understand how I would hold the mean of the other covariates constant as I use the predict function:

Changing the Y axis of default plot.gam graphs

Any deeper explanation into the method for doing this would be super helpful!!!

Community
  • 1
  • 1
Q.James
  • 41
  • 2
  • I've tried to provide an answer, but this entails you changing your entire approach (I think you should do a multiple regression-like model using `gam()` not separate univariate models). But assuming that you want to go that way, then I have explained how to hold some covariates constant whilst you predict giving a focus on one or two variables. – Gavin Simpson May 28 '19 at 23:19
  • If you want to stick with a single univariate model at a time, then I'm not seeing why `predict(model, newdata, type = "response", se.fit = TRUE)` doesn't do what you want? (If you are actually fitting a non-Gaussian model, then you want `type = "link"` and compute then back transform the fitted values and confidence interval as I show in my Answer). – Gavin Simpson May 28 '19 at 23:21
  • Hi Gavin, thanks so much for responding! My intention is definitely to use the multiple regression-like model using gam( ) with all the variables, not separate ones. I just couldn't find any examples online for doing multi. Excited to try your suggestions now and see if I can figure it out. – Q.James May 29 '19 at 00:00

1 Answers1

3

I'm not sure how helpful this will be as your Q is a little more open ended than we'd typically like on SO, but, here goes.

Firstly, I think it would help to think about modelling the response variable, which I assume is currently visibility. This is going to be a continuous variable, bounded at 0 (perhaps the data never reach zero?) which suggests modelling the data as conditionally distributed either

  • gamma (family = Gamma(link = 'log')) for visibility that never takes a value of zero.
  • Tweedie (family = tw()) for data that do have zeroes.

An alternative approach would be to model the occurrence of fog; if this is defined as an event <400m visibility then you could turn all your observations into 0/1 values for being a fog event or otherwise. Then you'd model the data as conditionally distributed Bernoulli, using family = binomial().

Having decided on a modelling approach, we need to model the response. This should be done using a multiple regression type of approach, with a GAM including multiple predictors. This way you get to estimate the effect of each potential predictor variable on the response while controlling for the effects of the other predictors. If you just do this using a single predictor at a time, say dewpoint, that variable could well "explain" variation in the data that might be due to another predictor, windspeed say, and you wouldn't know it.

Furthermore, there may well be interactions between predictors that you'll want to control for if they exist, which can only be done in

Then, to finally get to the crux of your problem, having fitted the multi-predictor model to "explain" visibility, you will need to predict from the model for sets of likely conditions. To look at how the visibility varies with dewpoint in a model where other predictor variables have effects, you need to fix the other variables at some reasonable values; one option is to set them to their mean (or modal value in the case of any factor predictor variables), or some other value indicative of typically values for that variable. You'll have to use your domain knowledge for this.

If you have interactions in the model, then you'll need to vary the two variables in the interaction, whilst holding all other variable fixed at some values.

Let's assume you don't have interactions and are interested in dewpoint but the model also includes windspeed. The mean windspeed for the values used to fit the model can be found from the cmX component of the fitted model. Of you could just calculate this from the observed windpseed values or set it to some known number you want to use. Denote the fitted by m, and the data frame with your data in it by df, then we can create new data to predict at over the range of dewpoint, whilst holding windspeed fixed.

mn.windspd <- m$cmX['windspeed']
## or
mn.windspd <- with(df, mean(windspeed))
## or set it some some value
mn.windspd <- 10 # say

Then you can do

preddata <- with(df,
                 expand.grid(dewpoint = seq(min(dewpoint),
                                            max(dewpoint),
                                            length = 300),
                             windspeed = mn.windspd))

Then you use this to predict from the fitted model:

pred <- predict(m, newdata = preddata, type = "link", se.fit = TRUE)
pred <- as.data.frame(pred)

Now we want to put these predictions back on to the response scale, and we want a confidence interval so we have to create that first before back transforming:

ilink <- family(m)$linkinv
pred <- transform(pred,
                  Fitted = ilink(fit),
                  Upper  = ilink(fit + (2 * se.fit)),
                  Lower  = ilink(fit - (2 * se.fit)),
                  dewpoint = preddata = dewpoint)

Now you can visualised the effect of dewpoint on the response whilst keeping windspeed fixed.

In your case, you will have to extend this to keeping temperature constant also, but that is done in the same way

mn.windspd <- m$cmX['windspeed']
mn.temp <- m$cmX['temperature']
preddata <- with(df,
                 expand.grid(dewpoint = seq(min(dewpoint),
                                            max(dewpoint),
                                            length = 300),
                             windspeed = mn.windspd,
                             temperature = mn.temp))

and then follow the steps above to do the prediction.

For one or two variables varying I have a function data_slice() in my gratia package which will do the above expand.grid() stuff for you so you don't have to specify the mean values of the other covariates:

preddata <- data_slice(m, 'dewpoint', n = 300)

technically this finds the value in the data closest to the median value (for the covariates not varying). If you want means, then do

fixdf <- data.frame(windspeed = mn.windspd, temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', data = fixdf, n = 300)

If you have an interaction, say between dewpoint and windspeed then you need to vary two variables. This is pretty easy again with expand.grid():

mn.temp <- m$cmX['temperature']
preddata <- with(df,
                 expand.grid(dewpoint = seq(min(dewpoint),
                                            max(dewpoint),
                                            length = 100),
                             windspeed = seq(min(windspeed),
                                             max(windspeed),
                                             length = 300),
                             temperature = mn.temp))

This will create a 100 x 100 grid of values of the covariates to predict at, whilst holding temperature constant.

For data_slice() you'd need to do:

fixdf <- data.frame(temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', 'windpseed',
                       data = fixdf, n = 300)

And extending this on to more covariates you want to vary, is also easy following this pattern with expand.grid(); I have yet to implement more than 2 variables varying in data_slice.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Thanks so much for the help! This worked great! Here is what I have so far: https://imgur.com/gJOdJSZ -- this is using the log family transformation and then converting it back with exp(). Hopefully that's correct! I really appreciate you explaining everything so well and I'm excited to continue using this as a tool. – Q.James Jun 03 '19 at 20:53
  • A couple follow up Qs -- 1) how would I extend this for categorical variables? I typically also add precipitation in the model, where 1 is a day with precipitation and 0 is a day without. However, it seems like holding that constant with the median would be inaccurate. – Q.James Jun 03 '19 at 20:57
  • Also ... 2) How do you determine whether you should be using interaction terms? Is it just based on previous knowledge about the system? 3) Why did you choose n=300 for making the matrix. My model is n=1764 -- would I not want the values used for the prediction to be close to the same size? – Q.James Jun 03 '19 at 20:59
  • Hi Gavin, when I try to include a categorical variable as a factor, the results of pred make a fit for each level. What is the best way to include a categorical variable, but hold some sort of average for the factor constant? – Q.James Jun 14 '19 at 00:12
  • Sorry - missed this. re 1) you'd do this with `rain` set to 1 and again set to 0 so you have the two states. or you can just do it for the reference category (so the combination of reference levels over all factors). – Gavin Simpson Jun 14 '19 at 20:13
  • Re 2) looking at the model residuals and plotting them against each covariate should help show if interactions are needed, or you may know that there is an interaction from domain knowledge or the literature. 3) I chose 300 for the univariate as that will give a reasonable smooth evaluation of the fitted smoother. It should have been 100 for both in the bivariate example as you are fitting then over a 100x100 grid; you only need to go higher if you have very complex smoothers using large EDFs, and the number here is just to get a smooth plot. – Gavin Simpson Jun 14 '19 at 20:15
  • With categorical variables, you need to hold them at some fixed value, either the reference level or for specific levels of the factor that you want to visualise. You have to choose the levels of each factor and the combinations of factors that you want to evaluate at. As you get more and more factors, it is easier to just work with the partial effects plots that `plot.gam()` produces of the smooths. – Gavin Simpson Jun 14 '19 at 20:17
  • 1
    Thank you for all your follow ups! I have learned so much from your response, other posts on stack, and your posts on fromthebottomoftheheap. I really appreciate it. – Q.James Jun 17 '19 at 00:52