11

I would like to display a plot created by geom_smooth() but it is important for me to be able to describe how the plot was created.

I can see from the documentation when n >= 1000, gam is used as the smoothing function, but I cannot see how many knots are used or what function generated the smoothing.

Example:

library(ggplot2)

set.seed(12345)
n <- 3000
x1 <- seq(0, 4*pi,, n)
x2 <- runif(n)
x3 <- rnorm(n)
lp <- 2*sin(2* x1)+3*x2 + 3*x3
p <- 1/(1+exp(-lp))
y <- ifelse(p > 0.5, 1, 0)

df <- data.frame(x1, x2, x3, y)

# default plot
ggplot(df, aes(x = x1, y = y)) +
  geom_smooth() 

# specify method='gam'
# linear
ggplot(df, aes(x = x1, y = y)) +
  geom_smooth(method = 'gam') 

# specify gam and splines
# Shows non-linearity, but different from default
ggplot(df, aes(x = x1, y = y)) +
  geom_smooth(method = 'gam',
              method.args = list(family = "binomial"),
              formula = y ~ splines::ns(x, 7)) 

If I want to use the default parameters, is there a way to identify the function used to create the smoothing so I can accurately describe it in a methods section of the analysis?

variations of geom_smooth

Z.Lin
  • 28,055
  • 6
  • 54
  • 94
A Toll
  • 647
  • 7
  • 15
  • 1
    possible answer here : http://stackoverflow.com/questions/9789871/method-to-extract-stat-smooth-line-fit – Nate Nov 28 '16 at 22:11
  • Or maybe this one: http://stackoverflow.com/questions/15584541/how-to-extract-fitted-splines-from-a-gam-mgcvgam#15587786 since it appears from ggplot2 documentation that it uses mgcv – HFBrowning Nov 28 '16 at 22:23
  • 2
    The default formula (per the documentation) for "gam" is supposed to be `formula = y ~ s(x, bs = "cs")`, and so `k` was set to whatever the default is from `mgcv::s`. However, it looks like in ggplot2_2.2.0 "gam" uses `formula = y ~ x` by default, so fits a straight line instead of any kind of spline. I have doubts if this is what it is supposed to do; the resulting plot is now the same as what you'd get using "lm". – aosmith Nov 28 '16 at 22:27
  • @aosmith I'm on 2.2.0, but `ggplot(diamonds, aes(price, carat)) + geom_smooth()` is definitely not straight.. – Axeman Nov 28 '16 at 22:51
  • @Axeman But look what happens when you manually specify `method = "gam"` instead of letting ggplot pick "gam" because of the >1000 points. Seems odd. – aosmith Nov 28 '16 at 22:59
  • @aosmith Huh. You wanna file that, or should I? – Axeman Nov 28 '16 at 23:00
  • 1
    @Axeman You can go ahead. Who knows, maybe it's always been this way and the user should specify the formula if choosing to use gam. – aosmith Nov 28 '16 at 23:01
  • 1
    @aosmith, idk, I downgraded to v1.0.0 and it was already the case.. – Axeman Nov 28 '16 at 23:13
  • For the record, Hadley addressed this [here](https://github.com/tidyverse/ggplot2/issues/1931): "This behaviour (default formula does not change from `y ~ x` to `y ~ s(x, bs = "cs")` if `method = "gam"` is stated explicitly) is **correct**; we report the formula but do not automatically change it. That's up to you." – Z.Lin Mar 07 '19 at 08:34

1 Answers1

6

I wrote a function to reverse-engineer the steps used in StatSmooth's setup_params function to get the actual method / formula parameters used for plotting.

The function expects a ggplot object as its input, with an additional optional parameter specifying the layer that corresponds to geom_smooth (defaults to 1 if unspecified). It returns a text string in the form "Method: [method used], Formula: [formula used]", and also prints out all the parameters to console.

The envisaged use case is two-fold:

  1. Add the text string as-is to the plot as plot title / subtitle / caption, for quick reference during analysis;
  2. Read off the console printout, & include the information elsewhere or manually format it nicely (e.g. parsed plotmath expressions) for annotation in the plot, for report / presentation.

Function:

get.params <- function(plot, layer = 1){

  # return empty string if the specified geom layer doesn't use stat = "smooth"
  if(!"StatSmooth" %in% class(plot$layers[[layer]]$stat)){
    message("No smoothing function was used in this geom layer.")
    return("")
  }

  # recreate data used by this layer, in the format expected by StatSmooth
  # (this code chunk takes heavy reference from ggplot2:::ggplot_build.ggplot)
  layer.data <- plot$layers[[layer]]$layer_data(plot$data)
  layout <- ggplot2:::create_layout(plot$facet, plot$coordinates)
  data <- layout$setup(list(layer.data), plot$data, plot$plot_env)
  data[[1]] <- plot$layers[[layer]]$compute_aesthetics(data[[1]], plot)
  scales <- plot$scales
  data[[1]] <- ggplot2:::scales_transform_df(scales = scales, df = data[[1]])
  layout$train_position(data, scales$get_scales("x"), scales$get_scales("y"))
  data <- layout$map_position(data)[[1]]

  # set up stat params (e.g. replace "auto" with actual method / formula)
  stat.params <- suppressMessages(
    plot$layers[[layer]]$stat$setup_params(data = data, 
                                           params = plot$layers[[layer]]$stat_params)
    )

  # reverse the last step in setup_params; we don't need the actual function
  # for mgcv::gam, just the name
  if(identical(stat.params$method, mgcv::gam)) stat.params$method <- "gam"

  print(stat.params)

  return(paste0("Method: ", stat.params$method, ", Formula: ", deparse(stat.params$formula)))
}

Demonstration:

p <- ggplot(df, aes(x = x1, y = y)) # df is the sample dataset in the question

# default plot for 1000+ observations
# (method defaults to gam & formula to 'y ~ s(x, bs = "cs")')
p1 <- p + geom_smooth()
p1 + ggtitle(get.params(p1))

# specify method = 'gam'
# (formula defaults to `y ~ x`)
p2 <- p + geom_smooth(method='gam')
p2 + ggtitle(get.params(p2))

# specify method = 'gam' and splines for formula
p3 <- p + geom_smooth(method='gam',
              method.args = list(family = "binomial"),
              formula = y ~ splines::ns(x, 7))
p3 + ggtitle(get.params(p3))

# specify method = 'glm'
# (formula defaults to `y ~ x`)
p4 <- p + geom_smooth(method='glm')
p4 + ggtitle(get.params(p4))

# default plot for fewer observations
# (method defaults to loess & formula to `y ~ x`)
# observe that function is able to distinguish between plot data 
# & data actually used by the layer
p5 <- p + geom_smooth(data = . %>% slice(1:500))
p5 + ggtitle(get.params(p5))

plot

Z.Lin
  • 28,055
  • 6
  • 54
  • 94