0

I used stat_smooth() in R for the first time and i want to know if there is a way to get, for each x, the distance between the data(y) and the predicted interval as you can see on the picture here:

enter image description here

Thank you for your precious help !

Jaap
  • 81,064
  • 34
  • 182
  • 193
Ezay
  • 1
  • 2
  • 1
    could you share your data and the code you used to generate the plot using `dput(data)`? – Thomas K Sep 25 '15 at 15:35
  • what to do with those data points which are inside the interval? For those cases: `distance == 0`? – Thomas K Sep 25 '15 at 16:04
  • 2
    Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to produce a [minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). This will make it much easier for others to help you. – Jaap Sep 25 '15 at 16:05
  • 2
    You not asked a proper question (as suggested above). But I also think you're not properly understanding what your question is statistically/programmatically. `stat_smooth` applies one of many smoothers, `stat::loess` by default. What you're asking is about the residual from the loess prediction and the data itself (or, more specifically, a transformation of the prediction --the upper or lower prediction interval). You would not do this via `ggplot2::stat_smooth` but via the smoothing method itself. – alexwhitworth Sep 25 '15 at 16:14
  • There are many parameters that go into this prediction interval (the method (loess, lm, glm, etc), the confidence level, etc.... I think you need to start simpler. Also, it would be helpful if you elaborated what you are trying to do (ie--why do you want this distance?). – alexwhitworth Sep 25 '15 at 16:15
  • This is not to discourage you, merely to suggest that you need a better understanding of your problem. And, that if you want help, you need to better communicate what **exactly** you want. – alexwhitworth Sep 25 '15 at 16:17

1 Answers1

0

As was pointed out in the comments above, clarification of your goals would be helpful.

If you want to replicate, what ggplot2 does and find distances for points outside of the interval, I have some code for you.

First I create some sample data and plot it:

library(ggplot2)
# sample data
set.seed(1234)
x <- c(1:100)
y <- c(1:100) + rnorm(100, sd = 5)
df <- data.frame(x, y)

ggplot(df, aes(x, y)) + geom_point(alpha = .4) + stat_smooth(span = .3)

plot_output

Then I replicate what ggplot2 does: I build a loess model (ggplot2 chooses loess if n < 1000), which I subsequently use to build the confidence intervals in the same way stat_smooth does. Note: Parameters of the model need to match the parameters you used in stat_smooth.

# find model, matching the span parameter from the graph above
model <- loess(y ~ x, data = df, span = 0.3)

# find x sequence
xseq <- sort(unique(df$x))

# function adapted from ggplot2::predictdf.loess:
# https://github.com/hadley/ggplot2/blob/f3b519aa90907f13f5d649ff6a512fd539f18b2b/R/stat-smooth-methods.r#L45
predict_loess <- function(model, xseq, level = 0.95) {
  pred <- stats::predict(model, newdata = data.frame(x = xseq), se = TRUE)

  y_pred = pred$fit
  ci <- pred$se.fit * stats::qt(level / 2 + .5, pred$df)
  ymin = y_pred - ci
  ymax = y_pred + ci

  data.frame(x = xseq, y_pred, ymin, ymax, se = pred$se.fit)
}

# predict your data
predicted_data <- predict_loess(model, xseq, level = 0.95)

# merge predicted data with original y
merged_data <- with(df, cbind(predicted_data, y))

head(merged_data)
#   x     y_pred       ymin     ymax       se         y
# 1 1 -0.5929504 -5.8628535 4.676953 2.652067 -5.035329
# 2 2  0.2828659 -4.1520646 4.717796 2.231869  3.387146
# 3 3  1.1796057 -2.5623056 4.921517 1.883109  8.422206
# 4 4  2.1074914 -1.0994171 5.314400 1.613870 -7.728489
# 5 5  3.0696584  0.2371895 5.902127 1.425434  7.145623
# 6 6  4.0568034  1.4454944 6.668113 1.314136  8.530279

From the replicated data we can now find the distances. For cases inside the interval it returns 0.

distances <- with(merged_data, ifelse(y < ymin, ymin - y,
                                      ifelse(y > ymax, y - ymax, 0)))
head(distances)
# [1] 0.000000 0.000000 3.500689 6.629071 1.243496 1.862167

It's not a very elegant solution, but it could point you in the right direction.

Thomas K
  • 3,242
  • 15
  • 29