36

With the following code:

library(ggplot2)
ggplot(mtcars, aes(x=wt, y=mpg)) +
    geom_point(aes(colour=factor(cyl))) +
    geom_smooth(method="lm")

I can get this plot:

enter image description here

My question is how does the grey zone defined? What's the meaning of it. And how can I play around with various parameter that control the width of that band?

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
neversaint
  • 60,904
  • 137
  • 310
  • 477

2 Answers2

42

By default, it is the 95% confidence level interval for predictions from a linear model ("lm"). The documentation from ?geom_smooth states that:

The default stat for this geom is stat_smooth see that documentation for more options to control the underlying statistical transformation.

Digging one level deeper, doc from ?stat_smooth tells us about the methods used to calculate the smoother's area.

For quick results, one can play with one of the arguments for stat_smooth which is level : level of confidence interval to use (0.95 by default)

By passing that parameter to geom_smooth, it is passed in turn to stat_smooth, so that if you wish to have a narrower region, you could use for instance .90 as a confidence level:

ggplot(mtcars, aes(x=wt, y=mpg)) +
    geom_point(aes(colour=factor(cyl))) +
    geom_smooth(method="lm", level=0.90)

enter image description here

Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
  • Thanks. What does confidence interval (CI) tells us here? How did you choose which is the 'ideal' level for CI? – neversaint Apr 11 '15 at 02:27
  • 14
    There's no "ideal" level, only more or less conservative (prudent) ones. For what it tells us, I'd suggest looking into `?predict` and `?predict.lm`. Basically it indicates the "range" in which our predictions would be if we were to repeat the experiment (sampling) over and over. One sampling leads to a single straight line of predictions; taking into account variability of the data, the zones indicate a range of possible straight lines, if you will. By setting level at .9, we say "if we were to repeat the sampling over and over, 90% of the regression lines would be inside that grey zone". – Dominic Comtois Apr 11 '15 at 02:51
  • 4
    Is ti possible to show something other than se? For example, the 10th and 90th quantiles of the data? – Simon Woodward Aug 22 '17 at 02:55
  • Why is it narrower the lower the chosen level is? – Ben Dec 10 '18 at 09:19
  • @Ben, it is narrower the lower the confidence interval, because the more the more one restricts the band the higher the chance that it was a fluke, and that the real regression curve falls outside. – gciriani Nov 22 '19 at 19:53
  • @Ben It's always a trade-off between precision and certitude (or confidence). If you want to be 99% confident of capturing the populational value (parameter), then your estimation needs to accommodate for quite a bit of departure from the estimate obtained with your current sample. Using a low confidence level = getting more precision at the cost of a high risk of missing the target. – Dominic Comtois Nov 16 '21 at 00:42
  • 1
    @SimonWoodward Maybe look into [quantile regression](https://ggplot2.tidyverse.org/reference/geom_quantile.html) – Dominic Comtois Nov 16 '21 at 00:45
9

It's the confidence interval. You can use se=FALSE if you do not want to display it. You can also use level = 0.99 if you want to have a 99% CI instead of a 95% CI. See ?stat_smooth for all the details.

shadow
  • 21,823
  • 4
  • 63
  • 77