0

I am trying to predict, on my own, the loess values provided by ggplot geom_smooth(). I have attached links to my data and the output plot of the predictions.

Data can be found here. I followed an example provided from this post about loess prediction to reproduce the values from ggplot, so I think I am on the right track, but I am missing something.

library("ggplot2")
load(file="data5a.RData")
lsmod = loess(Flux~DA_SQ_KM, data=data5a, control=loess.control(surface="direct"))
xrange <- max(data5a$DA_SQ_KM,na.rm=TRUE)
xseq <- c(0.01,0.05,0.1,0.2,0.3,0.5,seq(from=1, to=xrange, length=100))
pred = predict(lsmod,newdata=data.frame(DA_SQ_KM = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)

ggplot(data5a, aes(DA_SQ_KM, Flux)) + 
  geom_point()+
  geom_smooth(method="loess")+
  geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity",col="red")+
  geom_smooth(method="lm",se=FALSE,col="green")+
  theme(legend.position = "bottom")+
  scale_y_log10()+
  scale_x_log10()

Where is the error in my code for reproducing the data in the blue curve that is predicted by geom_smooth()?

Here is an image of the output within ggplot:

Output ggplot image

UPDATE 1:

I have included updated code here based on input provided by Roland. I have modified my code to use the mgcv::gam function since my data contains greater than 1000 points. The issue still remains that I cannot reproduce the model created by geom_smooth within ggplot. A new issue has also emerged with the confidence intervals.

library("ggplot2")
library("mgcv")
load(file="data5a.RData")

#Attempt to re-create the gam model myself
gammod = mgcv::gam(Flux~s(DA_SQ_KM, bs = "cs"),data=data5a)

xrange <- max(data5a$DA_SQ_KM,na.rm=TRUE)
xseq <- c(0.001,0.01,0.05,0.1,0.2,0.3,0.5,seq(from=1, to=xrange, length=100))

pred = predict(gammod ,newdata=data.frame(DA_SQ_KM = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
gam.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)


ggplot(data5a, aes(DA_SQ_KM, Flux)) + 
  geom_point()+
  geom_smooth(aes_auto(gam.DF), data=gam.DF, stat="identity",col="red")+
  stat_smooth(method=mgcv::gam,formula = y ~ s(x, bs = "cs"),se=TRUE,col="purple")+
  theme(legend.position = "bottom")+
  scale_y_log10()+
  scale_x_log10()

Here is the gam output within ggplot:

Output2 ggplot image

MNewco
  • 13
  • 4
  • It looks like you might have more than 1000 observations. In such a case, `stat_smooth` does not use `loess` but `mgcv::gam`. – Roland Jan 08 '19 at 06:43
  • Roland, thank you for pointing that out. I have modified my code accordingly and included an update to my post. The issue still remains however that I cannot seem to reproduce the predictions from ggplot `geom_smooth` with my own gam model. – MNewco Jan 08 '19 at 18:48
  • Please provide a reproducible example. – Roland Jan 08 '19 at 19:19
  • 1
    Within my post I have include the link to the dataset [link](https://www.dropbox.com/s/do4nrkqqjc01na1/data5a.RData?dl=0) and it should be completely reproducible including the creation of the figure. – MNewco Jan 08 '19 at 20:36
  • Sorry, I don't download data from dropbox. – Roland Jan 09 '19 at 07:14

1 Answers1

0

ggplot2 fits the model to the transformed variables if you use scale_* transformations:

DF <- data.frame(x = 1:3, y = c(10, 100, 1e3))

library(ggplot2)
p <- ggplot(DF, aes(x, y)) +
  geom_point() +
  scale_y_log10() +
  stat_smooth(method = "lm", n = 3)
g <- ggplot_build(p)
g[["data"]][[2]]
#  x y ymin ymax se PANEL group  colour   fill size linetype weight alpha
#1 1 1    1    1  0     1    -1 #3366FF grey60    1        1      1   0.4
#2 2 2    2    2  0     1    -1 #3366FF grey60    1        1      1   0.4
#3 3 3    3    3  0     1    -1 #3366FF grey60    1        1      1   0.4

Note the zero SEs, which indicate a perfect fit.

log10(predict(lm(y ~ x, data = DF)))
#  1        2        3 
#NaN 2.568202 2.937016 

predict(lm(log10(y) ~ x, data = DF))
#1 2 3 
#1 2 3 
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Roland, thanks. I see the problem now. I thought `scale_*` simply transformed my axes and not my data. – MNewco Jan 11 '19 at 20:19