0

I began to have this question while reading this book "R for Data Science" by Hadley Wickham. In the chapter of "Data Transformation", the author used this example:

library(nycflights13)
library(tidyverse)
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
delay <- filter(delay, count > 20, dest != "HNL")

# It looks like delays increase with distance up to ~750 miles 
# and then decrease. Maybe as flights get longer there's more 
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The output plot looks like as follows: relationship between delay and flight distance

So I wonder if we can know the extremum of the curve and the corresponding x value using R?

From this answer, I've figured using stat_poly_eq() from ggpmisc to calculate the polynomial regression equation:

library(ggpmisc)    
formula=y ~ poly(x, 3, raw=TRUE)
p <- ggplot(data = delay, mapping = aes(x = dist, y = delay)) 
p <- p + geom_point(aes(size = count), alpha = 1/3) 
p <- p+ geom_smooth(method = "lm", formula = formula, se = FALSE)
(p1 <- p+ stat_poly_eq(formula = formula, aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE))

Polynomial Equation

And use that equation to find out the extremum of the curve, which I think it's very not clever. So I wonder how can I figure out the extremum and the corresponding x value of my regression curve (not the max and min value of raw data but the max and the min value of the regression curve).

LazyMark2
  • 25
  • 3
  • don't have time to answer right now, but you can get there by running `ggplot_build()` on your `ggplot` object (assuming you have saved it to a variable), extracting the `$data` component, figuring out which layer corresponds to the data generated by `geom_smooth()`, and then looking for the max value of the `y` variable in that data frame. (This will only be the maximum y value in the set of predicted points from the smooth, not the absolute maximum ...) – Ben Bolker Jun 10 '21 at 19:24

1 Answers1

0

You can see in your first code snippet, your ggplot call outputs this message:

#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Looking at the documentation for geom_smooth and stat_smooth:

stats::loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs") with method = "REML".

So you could just call stats::loess on your data to fit the model implied by the geom_smooth curve, and then use the predict method to get values outside of your actual range.

henryn
  • 1,163
  • 4
  • 15