8

I used a linear regression on data I have, using the lm function. Everything works (no error message), but I'm somehow surprised by the result: I am under the impression R "misses" a group of points, i.e. the intercept and slope are not the best fit. For instance, I am referring to the group of points at coordinates x=15-25,y=0-20.

My questions:

  • is there a function to compare fit with "expected" coefficients and "lm-calculated" coefficients?
  • have I made a silly mistake when coding, leading the lm to do that?

Following some answers: additionnal information on x and y

x and y are both visual estimates of disease symptoms. There is the same uncertainty on both of them. Data plot, with linear regression and abline of expected results

The data and code are here:

x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6,
     21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1,
     20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4,
     18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5,
     17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4,
     29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2,
     27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0,
     34.2,34.0,32.6,30.8)

y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8,
     45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5,
     10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7,
     3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4,
     1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5,
     47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0,
     86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5,
     84.0,93.5,83.5,96.5,96.0,97.5)   



## x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))

# linear regression
reg_lin=lm(y1 ~ x1)
abline(reg_lin,lty="solid", col="royalblue")
text(12.5,25,labels="R result",col="royalblue", cex=0.85)
text(12.5,20,labels=bquote(y== .(5.26)*x - .(76)),col="royalblue", cex=0.85)

# result I would have imagined
abline(a=-150,b=8,lty="dashed", col="red")
text(27.5,25,labels="What I think is better",col="red", cex=0.85)
text(27.5,20,labels=bquote(y== .(8)*x - .(150)),col="red", cex=0.85)
NOTM
  • 147
  • 8
  • Just calculate the sum of squared residuals with your putative best fit and that produced by `lm`. – MichaelChirico Aug 06 '15 at 18:04
  • How can you say that the intercept and slope are not the best fit? and if they are not, under which model they are so, an linear, a loess, a generalized, etc. etc.? – SabDeM Aug 06 '15 at 18:04
  • Thanks for your answers.@SabDeM: I am refering to a linear model. I am not sure intercept and slope are not the best fit, but I am surprised the regression line is not passing between the points (esp. it _seems_ to be "missing" the group of points at the bottom of the graphics). But it's only an impression, of course. @MichaelChirico: is there any function to do that? – NOTM Aug 06 '15 at 18:05
  • 4
    your red line looks like it might be what you would get from total least squares (minimizing the distance in both directions) – Rorschach Aug 06 '15 at 18:20
  • @nongkrong I think what you have in mind is total _absolute deviation_, which is different. see my answer. – MichaelChirico Aug 06 '15 at 18:22
  • @MichaelChirico no, i was referring to total least squares, aka orthogonal regression. similar though – Rorschach Aug 06 '15 at 18:24
  • @nongkrong interesting, had never heard of it. – MichaelChirico Aug 06 '15 at 18:27
  • @nongkrong for reference, I would have phrased that as "minimizing the orthogonal distance from the prediction", as beautifully illustrated [here](http://stats.stackexchange.com/questions/13152/how-to-perform-orthogonal-regression-total-least-squares-via-pca) – MichaelChirico Aug 06 '15 at 18:29
  • 5
    Regarding "what you think is better": realize, our brains are powerful visualizers. What "seems" better is likely the line where all the points are closer **in x & y**. Least squares looks only at errors **in y**. If you'd wanted to consider error in both dimensions, perhaps look at [Total Least Squares](https://en.wikipedia.org/wiki/Total_least_squares) or something similar. – Mike Williamson Aug 06 '15 at 18:33
  • 2
    great question, good answers (maybe marginally better for CrossValidated ...) – Ben Bolker Aug 06 '15 at 18:45

2 Answers2

8

Try this:

reg_lin_int <- reg_lin$coefficients[1]
reg_lin_slp <- reg_lin$coefficients[2]

sum((y1 - (reg_lin_int + reg_lin_slp*x1)) ^ 2)
# [1] 39486.33
sum((y1 - (-150 + 8 * x1)) ^ 2)
# [1] 55583.18

The sum of squared residuals is lower under the lm fit line. This is to be expected, as reg_lin_int and reg_lin_slp are guaranteed to produce the minimal total squared error.

Intuitively, we know estimators under squared loss functions are sensitive to outliers. It's "missing" the group at the bottom because it gets closer to the group at the top left that's much further away--and squared distance gives these points more weight.

In fact, if we use Least Absolute Deviations regression (i.e., specify an absolute loss function instead of a square), the result is much closer to your guess:

library(quantreg)
lad_reg <- rq(y1 ~ x1)

lad

(Pro tip: use lwd to make your graphs much more readable)

What gets even closer to what you had in mind is Total Least Squares, as mentioned by @nongkrong and @MikeWilliamson. Here is the result of TLS on your sample:

v <- prcomp(cbind(x1, y1))$rotation
bbeta <- v[-ncol(v), ncol(v)] / v[1, 1]
inter <- mean(y1) - bbeta * mean(x1)

tls

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Ok then, the code is that simple. Thanks a lot. Frankly I'm quite puzzled: how can the sum of squares be minimized when the regression line is missing a group of points (again, the group at the bottom)? The penalty (in terms of sum of squares) should be hard to compensate for. – NOTM Aug 06 '15 at 18:16
  • 1
    @NOTM see update. It's about the loss function your intuition chose. – MichaelChirico Aug 06 '15 at 18:19
  • Ok, thanks again! I was reading pages on Total Least Squares and typing code to show the results of OLS, LAD and TLS on the same graphics. Look's like you did it quicker. Thanks again. – NOTM Aug 06 '15 at 18:50
  • @NOTM Keep in mind: 2^2 - 1^2 = 3 but 50^2 - 49^2 = 99. So when using least squares it's often beneficial to take a hit on making the line fit closer to the "outliers" to reduce the huge difference between the prediction and the observed value by a little since it will reduce the squared error by much more than trying to fit the line even closer to points it already fits well. So look at what your version of the best fit does (in terms of vertical error) compared to the least squares best fit for those "outliers" around x=20, y=100. – Dason Dec 31 '15 at 22:34
6

You got a nice answer already, but maybe this is also helpful:

As you know, OLS minimizes the sum of squared errors in y-direction. This implies that the uncertainty of your x-values is negligible, which is often the case. But possibly it's not the case for your data. If we assume that uncertainties in x and y are equal and do Deming regression we get a fit more similar to what you expected.

library(MethComp)
dem_reg <- Deming(x1, y1)
abline(dem_reg[1:2], col = "green")

resulting plot

You don't provide detailed information about your data. Thus, this might be useful or not.

Community
  • 1
  • 1
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Roland, @MikeWilliamson: you are right. I did OLS because that's the regression technique I know, not what I should use. In the case of my data, x and y are both visual estimates of disease symptoms: in other words, uncertainty is the same in x and y. So yes, TLS is much more appropriate. Thanks to you both. – NOTM Aug 06 '15 at 18:56
  • It's not relevant to you then, but I should still mention it: Deming regression allows you to specify the ratio of uncertainties in x and y (in case they are not equal). – Roland Aug 06 '15 at 19:03
  • Ok thanks, I'll look for this too! Learned a lot today. – NOTM Aug 06 '15 at 19:12
  • that's a nice explanation of deming. is there an easy way to do deming with an absolute loss function? – MichaelChirico Aug 06 '15 at 23:36