(This post is the second half of the problem from: How to apply loess.smoothing to both plot and then extract points?)
I have plotted loess smoothing to a scatterplot (i.e. between two quantitative variables). I would like to extract only the data points in the scatterplot that are above that smoothing line.
For instance, if this is my scatterplot:
qplot(mpg, cyl, data=mtcars)
I can plot the smoother as:
qplot(hp,wt,data=mtcars) + stat_smooth(method="loess")
Now, I want to extract only the data points that are above the smoother. I have played with code provided in (Method to extract stat_smooth line fit):
model <- loess(wt ~ hp, data=mtcars)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=80)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
This results in a data frame loess.DF of 80 rows and 5 columns.
I know now that I must apply a function to go through each row of the original mtcars data frame, and interpolate for each x-value (hp) its closest predicted loess y-value (wt). My only idea to accomplish this interpolation is to use a linear interpolation similar to (http://www.ajdesigner.com/phpinterpolation/linear_interpolation_equation.php). After that, I would simply compare the y-value in mtcars with the interpolated predicted loess y-value. If the y-value in mtcars is larger than the predicted loess y-value, then I keep that original data point; else, I remove it.
I started to code this, but realize that I cannot do so in an efficient way. One problem is that my (real) data set (which isn't mtcars) is very large (~40,000 rows): First, to do the linear interpolation, I need to find the two x values in the loess fit that are closest to the original x value in my data set (if there isn't exactly one perfect match), and I do not know how to do that efficiently without searching through increasing loess x-values.
How is an efficient way to approach this, for example, testing it out first on the mtcars data set? Thank you.