I used stat_smooth()
in R for the first time and i want to know if there is a way to get, for each x
, the distance between the data(y)
and the predicted interval as you can see on the picture here:
Thank you for your precious help !
I used stat_smooth()
in R for the first time and i want to know if there is a way to get, for each x
, the distance between the data(y)
and the predicted interval as you can see on the picture here:
Thank you for your precious help !
As was pointed out in the comments above, clarification of your goals would be helpful.
If you want to replicate, what ggplot2
does and find distances for points outside of the interval, I have some code for you.
First I create some sample data and plot it:
library(ggplot2)
# sample data
set.seed(1234)
x <- c(1:100)
y <- c(1:100) + rnorm(100, sd = 5)
df <- data.frame(x, y)
ggplot(df, aes(x, y)) + geom_point(alpha = .4) + stat_smooth(span = .3)
Then I replicate what ggplot2
does: I build a loess model (ggplot2
chooses loess if n < 1000
), which I subsequently use to build the confidence intervals in the same way stat_smooth
does. Note: Parameters of the model need to match the parameters you used in stat_smooth
.
# find model, matching the span parameter from the graph above
model <- loess(y ~ x, data = df, span = 0.3)
# find x sequence
xseq <- sort(unique(df$x))
# function adapted from ggplot2::predictdf.loess:
# https://github.com/hadley/ggplot2/blob/f3b519aa90907f13f5d649ff6a512fd539f18b2b/R/stat-smooth-methods.r#L45
predict_loess <- function(model, xseq, level = 0.95) {
pred <- stats::predict(model, newdata = data.frame(x = xseq), se = TRUE)
y_pred = pred$fit
ci <- pred$se.fit * stats::qt(level / 2 + .5, pred$df)
ymin = y_pred - ci
ymax = y_pred + ci
data.frame(x = xseq, y_pred, ymin, ymax, se = pred$se.fit)
}
# predict your data
predicted_data <- predict_loess(model, xseq, level = 0.95)
# merge predicted data with original y
merged_data <- with(df, cbind(predicted_data, y))
head(merged_data)
# x y_pred ymin ymax se y
# 1 1 -0.5929504 -5.8628535 4.676953 2.652067 -5.035329
# 2 2 0.2828659 -4.1520646 4.717796 2.231869 3.387146
# 3 3 1.1796057 -2.5623056 4.921517 1.883109 8.422206
# 4 4 2.1074914 -1.0994171 5.314400 1.613870 -7.728489
# 5 5 3.0696584 0.2371895 5.902127 1.425434 7.145623
# 6 6 4.0568034 1.4454944 6.668113 1.314136 8.530279
From the replicated data we can now find the distances. For cases inside the interval it returns 0
.
distances <- with(merged_data, ifelse(y < ymin, ymin - y,
ifelse(y > ymax, y - ymax, 0)))
head(distances)
# [1] 0.000000 0.000000 3.500689 6.629071 1.243496 1.862167
It's not a very elegant solution, but it could point you in the right direction.