2

I'm looking for the most efficient way to identify/extract data points that fall outside the CI shade in a correlation plot like this one:

ggplot(df,aes(x,y))+geom_point()+
stat_smooth(method = "lm", formula = y~poly(x, 2), size = 1, se = T, level = 0.99)

sample plot

I would like to be able to save a new variable which marks the data points that fall outside as follows:

    x     y      group
1:  0.0  0.00     1
2:  0.5  0.40     1
3:  0.9  0.70     1
4:  1.0  1.30     1
5:  2.0  6.60     0
6:  3.0  3.10     1
7:  4.0  4.40     1
8:  5.0  5.90     1
9:  6.0  6.05     1
10: 7.0  7.60     1
11: 8.0  8.00     1
12: 9.0  2.90     0
13: 10.0 13.80    1
14: 11.0 13.40    1
15: 12.0 14.90    1

Original Data:

df <- data.table("x"=c(0,0.5,0.9,1,2,3,4,5,6,7,8,9,10,11,12), 
      "y"=c(0,0.4,0.7,1.3,6.6,3.1,4.4,5.9,6.05,7.6,8,2.9,13.8,13.4,14.9))

Desired Data:

df2 <- data.table("x"=c(0,0.5,0.9,1,2,3,4,5,6,7,8,9,10,11,12), 
       "y"=c(0,0.4,0.7,1.3,6.6,3.1,4.4,5.9,6.05,7.6,8,2.9,13.8,13.4,14.9), 
       "group" = c(1,1,1,1,0,1,1,1,1,1,1,0,1,1,1))
pyne
  • 507
  • 1
  • 5
  • 16
  • 1
    Possible duplicate of [Find points over and under the confidence interval when using geom\_stat / geom\_smooth in ggplot2](http://stackoverflow.com/questions/33082901/find-points-over-and-under-the-confidence-interval-when-using-geom-stat-geom-s) – neilfws May 01 '17 at 01:06
  • perfect! thanks @neilfws – pyne May 01 '17 at 01:14

2 Answers2

5

Not sure how you can do this using ggplot. But you can also rerun the lm regression and deduce the points outside the confidence interval from there.

df$group=rep(1,nrow(df))
lm1=lm(y~poly(x,2),df)
p1=predict(lm1,interval="confidence",level=0.99)
df$group[df$y<p1[,2] | df$y>p1[,3]]=0
Lamia
  • 3,845
  • 1
  • 12
  • 19
3

First we would run a linear model lm() on your data corresponding to your smoothed fit. x + I(x^2) is the same thing as poly(x, 2), just written out. Then we augment the original data with the predictions from that model, which will be columns named .fitted, .resid, .se.fit. Then we can make a new variable called group where it's a logical test: is the distance between the observed y and the predicted .fitted greater than 2.58 times the standard error of the fit? This roughly corresponds to your 99% confidence interval from the smoothed line.

require(broom)
require(dplyr)

df %>% 
  do(augment(lm(y ~ x + I(x^2), data = .))) %>%
  mutate(group = as.numeric(abs(y - .fitted) > 2.58*.se.fit))

For funsies, we can view your data, and just color the points differently by that group variable:

df %>% 
  do(augment(lm(y ~ x + I(x^2), data = .))) %>%
  mutate(group = as.numeric(abs(y - .fitted) < 2.58*.se.fit)) %>%
  ggplot(aes(x, y)) + geom_point(aes(colour = factor(group)), size = 4) +
  stat_smooth(method = "lm", formula = y ~ poly(x, 2), size = 1, level = .99)

enter image description here


Edited to clarify

The question asked about the 99% CI. I mistakenly had "3" as the z-score to mark points outside the confidence interval. It is actually 2.58*.se.fit. For the 95% CI, it would be 1.96 (~2).

Brian
  • 7,900
  • 1
  • 27
  • 41
  • neat, though a little too advanced for me to fully understand the steps, but thank you! – pyne May 01 '17 at 01:30
  • Sorry, I wrote it a bit hastily. I'll edit it with comments. – Brian May 01 '17 at 01:44
  • @Brian Hi Brian, I could not understand in your answer the part about `the observed y and the predicted .fitted greater than 3 times the standard error of the fit? This roughly corresponds to your 99% confidence interval from the smoothed line.`. What is the `?` mean and How can we decide 95% confidence interval multiplier in this aspect ? – Alexander Aug 02 '17 at 15:53
  • The predicted value `.fitted` is the model's estimate of the mean at each value of `x`. The standard error of the mean is output from the model as `.se.fit`. For normally distributed variables, the 95% confidence interval is mean +/- 2.96*SE. If you look at the value `abs(y-.fitted)` that tells you how far away the data was from the model fit. If that distance is `< 3*.se.fit` then that datapoint is within the 95% CI. – Brian Aug 02 '17 at 16:02
  • 1
    @Brian Thanks for the answer. I am little bit lost in here. I look at the data and checked `abs(y-.fitted)` . If I put 2.96*SE to group one of the data in the top right side is outside of the CI 95%. I don't know you mis-typed 2.96 or not? because in [here] (https://en.wikipedia.org/wiki/Confidence_interval) in the table it is saying the number should be 1.96? I still cannot understand where this 3 came from in your solution? – Alexander Aug 02 '17 at 23:01
  • 1
    The 2.96 was from me mixing up the 95% and 99% confidence intervals. The z-score for 99% is 2.58 and for 95% is 1.96. Thanks for catching my error! – Brian Aug 02 '17 at 23:32
  • @Brian No problem. Thanks for the correction and your answer to this question. It is very useful for me:) – Alexander Aug 03 '17 at 17:10