5

I have a scatter plot,I want to know how can I find the genes above and below the confidence interval lines?

enter image description here


EDIT: Reproducible example:

library(ggplot2)
#dummy data
df <- mtcars[,c("mpg","cyl")]

#plot
ggplot(df,aes(mpg,cyl)) +
  geom_point() +
  geom_smooth()

enter image description here

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
star
  • 743
  • 1
  • 7
  • 19
  • 7
    You can start by including your code and data. – nrussell Oct 12 '15 at 13:50
  • `identify(x, y...)` but the a part of your data is needed – Mateusz1981 Oct 12 '15 at 14:04
  • 1
    Note that the confidence interval lines are a confidence interval for the mean of the data, not for the data itself. And because you have so much data, I'd expect most of the values to be outside of the interval. – bramtayl Oct 12 '15 at 15:00
  • I want to select point lower than Q1 and upper thanQ3? I should filter my data according which variable(cv or mean express)? or is there any solution for it? – star Oct 14 '15 at 08:44
  • Dear all, after doing your commands, I drew scatter plot for outer=1 and unfortunately got same results as my first scatter plot!!!! How can I extract outlier points? Best, – star Oct 20 '15 at 16:29
  • I think ,I should change "the confidence interval is created as pred$fit + pred$se.fit * qt(0.95 / 2 + .5, pred$df) and the lower boundary as pred$fit - pred$se.fit * qt(0.95 / 2 + .5, pred$df)" , but I donot know how? – star Oct 20 '15 at 17:43

3 Answers3

8

I had to take a deep dive into the github repo but I finally got it. In order to do this you need to know how stat_smooth works. In this specific case the loess function is called to do the smoothing (the different smoothing functions can be constructed using the same process as below):

So, using loess on this occasion we would do:

#data
df <- mtcars[,c("mpg","cyl"), with=FALSE]
#run loess model
cars.lo <- loess(cyl ~ mpg, df)

Then I had to read this in order to see how the predictions are made internally in stat_smooth. Apparently hadley uses the predictdf function (which is not exported to the namespace) as follows for our case:

predictdf.loess <- function(model, xseq, se, level) {
  pred <- stats::predict(model, newdata = data.frame(x = xseq), se = se)

  if (se) {
    y = pred$fit
    ci <- pred$se.fit * stats::qt(level / 2 + .5, pred$df)
    ymin = y - ci
    ymax = y + ci
    data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
  } else {
    data.frame(x = xseq, y = as.vector(pred))
  }
}

After reading the above I was able to create my own data.frame of the predictions using:

#get the predictions i.e. the fit and se.fit vectors
pred <- predict(cars.lo, se=TRUE)
#create a data.frame from those
df2 <- data.frame(mpg=df$mpg, fit=pred$fit, se.fit=pred$se.fit * qt(0.95 / 2 + .5, pred$df))

Looking at predictdf.loess we can see that the upper boundary of the confidence interval is created as pred$fit + pred$se.fit * qt(0.95 / 2 + .5, pred$df) and the lower boundary as pred$fit - pred$se.fit * qt(0.95 / 2 + .5, pred$df).

Using those we can create a flag for the points over or below those boundaries:

#make the flag
outerpoints <- +(df$cyl > df2$fit + df2$se.fit |  df$cyl < df2$fit - df2$se.fit)
#add flag to original data frame
df$outer <- outerpoints

The df$outer column is probably what the OP is looking for (it takes the value of 1 if it is outside the boundaries or 0 otherwise) but just for the sake of it I am plotting it below.

Notice the + function above is only used here to convert the logical flag into a numeric.

Now if we plot as this:

ggplot(df,aes(mpg,cyl)) +
  geom_point(aes(colour=factor(outer))) +
  geom_smooth() 

We can actually see the points inside and outside the confidence interval.

Output:

enter image description here

P.S. For anyone who is interested in the upper and lower boundaries, they are created like this (speculation: although the shaded areas are probably created with geom_ribbon - or something similar - which makes them more round and pretty):

#upper boundary
ggplot(df,aes(mpg,cyl)) +
   geom_point(aes(colour=factor(outer))) +
   geom_smooth() +
   geom_line(data=df2, aes(mpg , fit + se.fit , group=1), colour='red')

#lower boundary
ggplot(df,aes(mpg,cyl)) +
   geom_point(aes(colour=factor(outer))) +
   geom_smooth() +
   geom_line(data=df2, aes(mpg , fit - se.fit , group=1), colour='red')
Jaap
  • 81,064
  • 34
  • 182
  • 193
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
8

This solution takes advantage of the hard work ggplot2 does for you:

library(sp)

# we have to build the plot first so ggplot can do the calculations
ggplot(df,aes(mpg,cyl)) +
  geom_point() +
  geom_smooth() -> gg

# do the calculations
gb <- ggplot_build(gg)

# get the CI data
p <- gb$data[[2]]

# make a polygon out of it
poly <- data.frame(
  x=c(p$x[1],    p$x,    p$x[length(p$x)],    rev(p$x)), 
  y=c(p$ymax[1], p$ymin, p$ymax[length(p$x)], rev(p$ymax))
)

# test for original values in said polygon and add that to orig data
# so we can color by it
df$in_ci <- point.in.polygon(df$mpg, df$cyl, poly$x, poly$y)

# re-do the plot with the new data
ggplot(df,aes(mpg,cyl)) +
  geom_point(aes(color=factor(in_ci))) +
  geom_smooth()

enter image description here

It needs a bit of tweaking (i.e that last point getting a 2 value) but I'm limited on time. NOTE that the point.in.polygon return values are:

  • 0: point is strictly exterior to pol
  • 1: point is strictly interior to pol
  • 2: point lies on the relative interior of an edge of pol
  • 3: point is a vertex of pol

so it should be easy to just change the code to TRUE/FALSE whether value is 0 or not.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
6

Using ggplot_build like @hrbrmstr's nice solution, you can actually do this by simply passing a sequence of x values to geom_smooth specifying where the errors bounds should be calculated, and make this equal to the x-values of your points. Then, you just see if the y-values are within the range.

library(ggplot2)

## dummy data
df <- mtcars[,c("mpg","cyl")]

ggplot(df, aes(mpg, cyl)) +
  geom_smooth(params=list(xseq=df$mpg)) -> gg

## Find the points within bounds
bounds <- ggplot_build(gg)[[1]][[1]]
df$inside <- with(df, bounds$ymin < cyl & bounds$ymax > cyl)

## Add the points
gg + geom_point(data=df, aes(color=inside)) + theme_bw()

enter image description here

Rorschach
  • 31,301
  • 5
  • 78
  • 129