0

Edit: following interactions in the responses below, I believe there may be some issues with the plot() or plot.gam() functions when dealing with gam outputs. See responses below.


I am running a non parametric regression model <- gam(y ~ x, bs = "cs", data = data).

My data looks like what follows, where x is in logs. I have 273 observations

          y         x
[1,] 0.010234756 10.87952
[2,] 0.009165001 10.98407
[3,] 0.001330975 11.26850
[4,] 0.008000957 10.97803
[5,] 0.008579472 10.94924
[6,] 0.009746714 11.01823

I would like to plot the output of the model, basically the fitted curve. When I do

# graph
plot(model)

or

ggplot(data = data, mapping = aes(x = x y = y)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method="gam", formula= y~s(x, bs = "cs") )

I get the desired output graphs (apologies for the original labels):

[first one with plot1

second one with ggplot

However, the two plotted curves are not exactly the same and I did not manage to find the parameters to tweak to remove the differences. Therefore I would like to plot the curve manually. Here it's my current attempt.

model <- gam(y~ s(x), bs = "cs", data = data)
names(model)
# summary(model)
model_fit <- as.data.frame(cbind(model$y, model$fitted.values, 
                                   model$linear.predictors, data$x, 
                                   model$residuals))
names(model_fit) <- c("y", "y_fit", "linear_pred", "x", "res")


### here the plotting
ggplot(model_fit) +
  geom_point(aes(x = x, y = y_fit), size = 0.5, alpha = 0.5) +
  geom_line(aes(x = x, y = y_fit))
  

However I get the following warning

geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?

and wrong output graph very bad output

I do not seem to be able to fix the last graph (it seems the error is in geom_point() ) and add the confidence intervals, nor to find where to tweak the first two to make them exactly the same.

Bob
  • 452
  • 6
  • 18
  • 1
    For the original comparison you could likely just change your `n` in `stat_smooth` - see https://stackoverflow.com/questions/33040344/stat-smooth-gam-not-the-same-as-gam-mgcv – B Williams Apr 13 '21 at 16:25
  • 1
    I doubt a difference between 80 and 100 evaluation points is responsible for these differences. It's more likely due to the difference in the algorithm used to select smoothness parameters; the OP used the default GCV but stat_smooth uses REML, which is preferred. – Gavin Simpson Apr 14 '21 at 05:53
  • The REML method solved the curve slope, but I believe there is an issue with `plot()` and `plot.gam()` . I made some tests and posted the outcome here https://stackoverflow.com/a/67096214/2291642 . I'd appreciate a more knowledgeable take at it. Thanks. – Bob Apr 14 '21 at 17:17

2 Answers2

2

The difference is likely due to you using different fitting algorithms. The default in gam() is (currently) method = "GCV.Cp" even through the recommended option is to use method = "REML". stat_smooth() uses method = "REML". GCV-based smoothness selection is known to undersmooth in some circumstances and this seems to be the case here with the REML solution being a much smoother curve.

If you change to method = "REML" in your gam() call, the differences should disappear.

That said, you really shouldn't be ripping things out of model objects like that - for a set off $residuals is not what you think it is - it's not useful in this context as those are the working residuals for PIRLS algorithm. Use the extractor functions like fitted(), residuals() etc.

The easiest way to plot your own version of that drawn by plot.gam() is to capture the object returned by plot.gam() and then use that object to draw what you need.

Via plot.gam()

df <- data_sim("eg1", seed = 2)
m <- gam(y ~ s(x2), data = df, method = "REML")
p_obj <- plot(m, residuals = TRUE)
p_obj <- p_obj[[1]] # just one smooth so select the first component
sm_df <- as.data.frame(p_obj[c("x", "se", "fit")])
data_df <- as.data.frame(p_obj[c("raw", "p.resid")])

## plot
ggplot(sm_df, aes(x = x, y = fit)) +
  geom_rug(data = data_df, mapping = aes(x = raw, y = NULL),
           sides = "b") +
  geom_point(data = data_df, mapping = aes(x = raw, y = p.resid)) +
  geom_ribbon(aes(ymin = fit - se, ymax = fit + se, y = NULL),
              alpha = 0.3) +
  geom_line() +
  labs(x = p_obj$xlab, y = p_obj$ylab)

Which produces

enter image description here

Alternatively, you might look at my {gratia} package or the {mgcViz} package of Matteo Fasiolo as options that will do this all for you.

{gratia} example

For example with {gratia}

library('gratia')
draw(m, residuals = TRUE)

which produces

enter image description here

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Thank you. I will take a look at the {gratia} package. However, without considering the residuals, should not the fitted values be enough to plot the curve? I still do not get why the last graph is not working as expected. Do you have any advice? Thanks – Bob Apr 14 '21 at 15:24
  • I made some tests and posted the outcome here https://stackoverflow.com/a/67096214/2291642 . I'd appreciate a more knowledgeable take at it. Thanks. – Bob Apr 14 '21 at 17:17
  • @Bob I have included code to show how to get what you want from `plot.gam()` or via my {gratia} package – Gavin Simpson Apr 14 '21 at 21:37
0

The solution provided by @Gavin Simpson here partially solves the issue, meaning that to make the two curves equal, one needs to add the method = "REML". The two curves then have the same slope.

However, for some reason, when plotting the output of a gam() model using either plot() or plot.gam(), the curve does not fit properly the original data as it should. The same happens by manually plotting the graph by extracting the elements from the object returned by plot.gam(). I am not sure why this happens. In my case, the fitted curve is shifted downwards, clearly "missing" the data points it is supposed to fit. Below the code and the corresponding output graph, the latter being the same you get in plot() or plot.gam() with the addition of the original data points to the plots.

plot(model_1)
# or plot.gam(model_1)


data.plot = as.data.frame(cbind(b[[1]]$x, b[[1]]$fit, b[[1]]$se))
ggplot(data=data.plot, mapping = aes(x= data.plot$V1, y= data.plot$V2)) +
  geom_line(aes(x = V1, y = V2)) +
  geom_line(aes(x= V1, y = V2 + V3 ), linetype="dashed") +
  geom_line(aes(x= V1, y = V2 - V3 ), linetype ="dashed") +
  geom_point(data= df_abs, aes(x= log(prd_l_1999), y=prd_gr), size = 0.5, alpha = 0.5) 

Misplaced graphs this is wrong this is wrong too

To note that the ggplot function makes the plot properly. Therefore, my ignorant guess is that this may be an issue with the plotting method.

Working solution

I am not able to prove that the issue is with the plotting functions, but it turns out that this is the same issue as in this question and the partial solution provided by the OP fixes the plotting while still using the gam() function. Below (his) code adapted to my case and the corresponding output graph. As you can see, the graph is plotted properly and the curve fits the data as it is supposed to do. I'd say this may corroborate my hypothesis even though I cannot prove it as I am not knowledgeable enough.

library(data.table)

model_1 <- gam(prd_gr ~ s(log(prd_l_1999)), bs = "cs",  data = df_abs, method = "REML")    


preds <- predict(model_1,se.fit=TRUE)
my_data <- data.frame(mu=preds$fit, low =(preds$fit - 1.96 * preds$se.fit), high = (preds$fit + 1.96 * preds$se.fit))

ggplot()+
  geom_line(data = my_data, aes(x=log(df_abs$prd_l_1999), y=mu), size=1, col="blue")+
  geom_smooth(data=my_data,aes(ymin = low, ymax = high, x=log(df_abs$prd_l_1999), y = mu), stat = "identity", col="green")

enter image description here

Bob
  • 452
  • 6
  • 18
  • Note that `plot.gam()` is producing a partial effect plot and as such it doesn't necessarily go through the data; it is only showing the effect of s(x), which is centred about the data average. What `plot.gam()` shows as data are actually partial residuals, not the data. `predict()` returns the actual fitted response (so b0 + s(x) ) and hence it should go near the data. This is not a partial effect plot so you are producing two different types of plot. Both are correct, just used for different things, and you can't add the data to the partial effect plot as they don't need to overlap – Gavin Simpson Apr 14 '21 at 18:47
  • When you say **partial effect**, do you mean that **b0** is left out, as any other additional explanatory variable one may add other than **x** ? I see, thanks. – Bob Apr 14 '21 at 21:15
  • 1
    For `plot.gam()` yes, the intercept is left off - that's why the plots are centred around 0 - as are any other terms in the model. When you `predict()` you include the intercept plus the effects of any other terms in the model. – Gavin Simpson Apr 14 '21 at 21:18