4

I have tested a large sample of participants on two different tests of visual perception – now, I'd like to see to what extent performance on both tests correlates.

To visualise the correlation, I plot a scatterplot in R using ggplot() and I fit a regression line (using stat_smooth()). However, since both my x and y variable are performance measures, I need to take both of them into account when fitting my regression line – thus, I cannot use a simple linear regression (using stat_smooth(method="lm")), but rather need to fit an orthogonal regression (or Total least squares). How would I go about doing this?

I know I can specify formula in stat_smooth(), but I wouldn't know what formula to use. From what I understand, none of the preset methods (lm, glm, gam, loess, rlm) are applicable.

talat
  • 68,970
  • 21
  • 126
  • 157
rvrvrv
  • 881
  • 3
  • 9
  • 29
  • 1
    You can pass the `slope` and `intercept` from your model to `geom_abline` or you can use the approach shown [here](http://stackoverflow.com/questions/19284897/smooth-pspline-wrapper-for-stat-smooth-in-ggplot2) to create your own method – user20650 Nov 18 '14 at 15:06

4 Answers4

8

It turns out that you can extract the slope and intercept from principal components analysis on (x,y), as shown here. This is just a little simpler, runs in base R, and gives the identical result to using Deming(...) in MethComp.

# same `x and `y` as @user20650's answer
df  <- data.frame(y, x)
pca <- prcomp(~x+y, df)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])

ggplot(df, aes(x,y)) + 
  geom_point() + 
  stat_smooth(method=lm, color="green", se=FALSE) +
  geom_abline(slope=slp, intercept=int, color="blue")

Brian D
  • 2,570
  • 1
  • 24
  • 43
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Great method while avoiding additional packages. One further question, more of aesthetics perhaps, is how to limit the length of `geom_abline()` to the data, like `stat_smooth()`? Currently `geom_abline()` extends all the way across the plot, regardless of whether the data points extend all the way across the plot. – rvrvrv Nov 21 '14 at 10:56
  • 1
    One way is to use `geom_segment`. You know the minimum and maximum of the x-range in the data, so use the slope and intercept to calculate the y values at these limits and then use `geom_segment` to draw the line. Or you could replace the `Deming` function with the nice `prcomp` method in the function `f` below. – user20650 Nov 21 '14 at 12:18
4

Caveat: not familiar with this method

I think you should be able to just pass the slope and intercept to geom_abline to produce the fitted line. Alternatively, you could define your own method to pass to stat_smooth (as shown at the link smooth.Pspline wrapper for stat_smooth (in ggplot2)). I used the Deming function from the MethComp package as suggested at link How to calculate Total least squares in R? (Orthogonal regression).

library(MethComp)
library(ggplot2)

# Sample data and model (from ?Deming example) 
set.seed(1)
M <- runif(100,0,5)
# Measurements:
x <-         M + rnorm(100)
y <- 2 + 3 * M + rnorm(100,sd=2)

# Deming regression
mod <- Deming(x,y)

# Define functions to pass to stat_smooth - see mnel's answer at link for details
# Defined the Deming model output as class Deming to define the predict method
# I only used the intercept and slope for predictions - is this correct?
f <- function(formula,data,SDR=2,...){
        M <- model.frame(formula, data)
        d <- Deming(x =M[,2],y =M[,1], sdr=SDR)[1:2]
        class(d) <- "Deming"
        d  
        }

# an s3 method for predictdf (called within stat_smooth)
predictdf.Deming <- function(model, xseq, se, level) {
                         pred <- model %*% t(cbind(1, xseq) )
                         data.frame(x = xseq, y = c(pred))
                         }

ggplot(data.frame(x,y), aes(x, y)) + geom_point() + 
          stat_smooth(method = f, se= FALSE, colour='red', formula=y~x, SDR=1) +  
          geom_abline(intercept=mod[1], slope=mod[2], colour='blue') +
          stat_smooth(method = "lm", se= FALSE, colour='green', formula = y~x)

enter image description here

So passing the intercept and slope to geom_abline produces the same fitted line (as expected). So if this is the correct approach then imo its easier to go with this.

Community
  • 1
  • 1
user20650
  • 24,654
  • 5
  • 56
  • 91
3

The MethComp package seems to be no longer maintained (was removed from CRAN). Russel88/COEF allows to use stat_/geom_summary with method="tls" to add an orthogonal regression line.

Based on this and wikipedia:Deming_regression I created the following functions, which allow to use noise ratios other than 1:


deming.fit <- function(x, y, noise_ratio = sd(y)/sd(x)) {
  if(missing(noise_ratio) || is.null(noise_ratio)) noise_ratio <- eval(formals(sys.function(0))$noise_ratio) # this is just a complicated way to write `sd(y)/sd(x)`
  delta <-  noise_ratio^2
  x_name <- deparse(substitute(x))

  s_yy <- var(y)
  s_xx <- var(x)
  s_xy <- cov(x, y)
  beta1 <- (s_yy - delta*s_xx + sqrt((s_yy - delta*s_xx)^2 + 4*delta*s_xy^2)) / (2*s_xy)
  beta0 <- mean(y) - beta1 * mean(x) 

  res <- c(beta0 = beta0, beta1 = beta1)
  names(res) <- c("(Intercept)", x_name)
  class(res) <- "Deming"
  res
}

deming <- function(formula, data, R = 100, noise_ratio = NULL, ...){
  ret <- boot::boot(
    data = model.frame(formula, data), 
    statistic = function(data, ind) {
      data <- data[ind, ]
      args <- rlang::parse_exprs(colnames(data))
      names(args) <- c("y", "x")
      rlang::eval_tidy(rlang::expr(deming.fit(!!!args, noise_ratio = noise_ratio)), data, env = rlang::current_env())
    },
    R=R
  )
  class(ret) <- c("Deming", class(ret))
  ret  
}

predictdf.Deming <- function(model, xseq, se, level) {
  pred <- as.vector(tcrossprod(model$t0, cbind(1, xseq)))
  if(se) {
    preds <- tcrossprod(model$t, cbind(1, xseq))
    data.frame(
      x = xseq,
      y = pred,
      ymin = apply(preds, 2, function(x) quantile(x, probs = (1-level)/2)),
      ymax = apply(preds, 2, function(x) quantile(x, probs = 1-((1-level)/2)))
    )
  } else {
    return(data.frame(x = xseq, y = pred))
  }
}

# unrelated hlper function to create a nicer plot:
fix_plot_limits <- function(p) p + coord_cartesian(xlim=ggplot_build(p)$layout$panel_params[[1]]$x.range, ylim=ggplot_build(p)$layout$panel_params[[1]]$y.range)

Demonstration:

library(ggplot2)

#devtools::install_github("Russel88/COEF")
library(COEF)

fix_plot_limits(
    ggplot(data.frame(x = (1:5) + rnorm(100), y = (1:5) + rnorm(100)*2), mapping = aes(x=x, y=y)) +
      geom_point()
    ) +
  geom_smooth(method=deming, aes(color="deming"), method.args = list(noise_ratio=2)) +
  geom_smooth(method=lm, aes(color="lm")) +
  geom_smooth(method = COEF::tls, aes(color="tls"))

Created on 2019-12-04 by the reprex package (v0.3.0)

jan-glx
  • 7,611
  • 2
  • 43
  • 63
  • May you know why the confidence interval calculated using your function at `noise_ratio = 1` is slightly different from that produced by the COEF::tls method? P.S. Beautiful answer as it is the only one that includes confidence intervals. Thank you! – Cris Sep 28 '20 at 15:17
  • 1
    They are estimated with the bootstrap thus are stochastic, you could try to increase the number of bootstrap samples (`R`) and check if that helps to make them more similar! Let me know if there is a systematic difference! – jan-glx Sep 28 '20 at 16:03
  • Oh, you're right! the interval slightly changes in between runs. Thank you. – Cris Sep 28 '20 at 18:53
  • 1
    Extending geom_smooth like this is discouraged by the authors (https://github.com/tidyverse/ggplot2/issues/3132) (but I am not sure what the alternatives are). To work around the non-export of predictdf y ggplot2 one could use `registerS3method("predictdf", "YourClass", yourpackage:::predictdf.YourClass, envir = environment(ggplot2:::predictdf))` in yourpackage's `.on.load` – jan-glx Dec 01 '20 at 16:35
  • [Another geom_smooth hack.](https://stackoverflow.com/a/58187435/1870254). – jan-glx Dec 01 '20 at 16:35
1

For anyone who is interested, I validated jhoward's solution against the deming::deming() function, as I was not familiar with jhoward's method of extracting the slope and intercept using PCA. They indeed produce identical results. Reprex is:

# Sample data and model (from ?Deming example) 
set.seed(1)
M <- runif(100,0,5)
# Measurements:
x <-         M + rnorm(100)
y <- 2 + 3 * M + rnorm(100,sd=2)

# Make data.frame()
df <- data.frame(x,y)

# Get intercept and slope using deming::deming()
library(deming)
mod_Dem <- deming::deming(y~x,df)
slp_Dem <- mod_Dem$coefficients[2]
int_Dem <- mod_Dem$coefficients[1]

# Get intercept and slope using jhoward's method
pca <- prcomp(~x+y, df)
slp_jhoward <- with(pca, rotation[2,1] / rotation[1,1])
int_jhoward <- with(pca, center[2] - slp_jhoward*center[1])

# Plot both orthogonal regression lines and simple linear regression line
library(ggplot2)
ggplot(df, aes(x,y)) + 
  geom_point() + 
  stat_smooth(method=lm, color="green", se=FALSE) +
  geom_abline(slope=slp_jhoward, intercept=int_jhoward, color="blue", lwd = 3) +
  geom_abline(slope=slp_Dem, intercept=int_Dem, color = "white", lwd = 2, linetype = 3)

enter image description here

Interestingly, if you switch the order of x and y in the models (i.e., to mod_Dem <- deming::deming(x~y,df) and pca <- prcomp(~y+x, df)) , you get completely different slopes:

enter image description here

My (very superficial) understanding of orthogonal regression was that it does not treat either variable as independent or dependent, and thus that the regression line should be unaffected by how the model is specified, e.g., as y~x vs x~y. Clearly I was very much mistaken, and I would be interested to hear anyone's thoughts about exactly why I was so wrong.

PhelsumaFL
  • 61
  • 8