R - linear model does not match experimental data

Question

I am trying to perform a linear regression on experimental data consisting of replicate measures of the same condition (for several conditions) to check for the reliability of the experimental data. For each condition I have ~5k-10k observations stored in a data frame df:

[1]    cond1 repA    cond1 repB   cond2 repA   cond2 repB ...
[2]    4.158660e+06  4454400.703  ...
[3]    1.458585e+06  4454400.703  ...
[4]    NA            887776.392   ...
...
[5024] 9571785.382   9.679092e+06 ...

I use the following code to plot scatterplot + lm + R^2 values (stored in rdata) for the different conditions:

for (i in seq(1,13,2)){
  vec <- matrix(0, nrow = nrow(df), ncol = 2)
  vec[,1] <- df[,i]
  vec[,2] <- df[,i+1]
  vec <- na.exclude(vec)
  plot(log10(vec[,1]),log10(vec[,2]), xlab = 'rep A', ylab = 'rep B' ,col="#00000033")
  abline(fit<-lm(log10(vec[,2])~log10(vec[,1])), col='red')
  legend("topleft",bty="n",legend=paste("R2 is",rdata[1,((i+1)/2)] <- format(summary(fit)$adj.r.squared,digits=4)))
}

However, the lm seems to be shifted so that it does not fit the trend I see in the experimental data:

It consistently occurs for every condition. I unsuccesfully tried to find an explanation by looking up the scource code and browsing different forums and posts (this or here).

Welcome to StackOverflow! Please read the info about how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. — Axeman, Jan 02 '17 at 15:05
Also, can you elaborate why you think the result is wrong? is the only problem that the regression line is not exactly on the data in the upper right corner? — Axeman, Jan 02 '17 at 15:09
You could compare the `lm` fit to the fit of a local linear regression (e.g. `loess`) to get a better picture of what is going on; my suspicion is that the (local) linear fit for lower `A` has markedly shallower slope than for higher `A`, explaining the overall fit that does not go through the points on the top-right. — Thales, Jan 02 '17 at 15:20

score 2 · Accepted Answer · edited May 23 '17 at 12:33

Would have like to simply comment/ask a few questions, but can't.

From what I've understood, both repA and repB are measured with error. Hence, you cannot fit your data using an ordinary least square procedure, which only takes into account the error in Y (some might argue a weighted OLS may work, however I'm not skilled enough to discuss that). Your question seem linked to this one.

What you can use is a total least square procedure: it takes into account the error in X and Y. In the example below, I've used a "normal" TLS assuming there is the same error in X and Y (thus error.ratio=1). If it is not, you can specify the error ratio by entering error.ratio=var(y1)/var(x1) (at least I think it's var(Y)/var(X): check on the documentation to ensure that).

library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]

# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]

Please note that, in Deming/TLS regressions, your x- and y-errors are supposed to follow normal distribution, as explained here. If it's not the case, go for a Passing-Bablok regressions (and the R code is here).

Also note that the R2 isn't defined for Deming nor Passing Bablok regressions (see here). A correlation coefficient is a good proxy, although it does not exactly provide the same information. Since you're studying a linear correlation between two factors, see Pearson's product moment correlation coefficient, and use e.g. the rcorrfunction.

See also regression attenuation/dilution, e.g. https://en.wikipedia.org/wiki/Regression_dilution — Thales, Jan 03 '17 at 09:35

R - linear model does not match experimental data

1 Answers1