8

I have plotted a 2-D histogram in a way that I can add to the plot with lines, points etc. Now I seek to apply a linear regression fit at the region of dense points, however my linear regression line seems totally off where it should be? To demonstrate here is my plot on the left with both a lowess regression fit and linear fit.

lines(lowess(na.omit(a),na.omit(b),iter=10),col='gray',lwd=3)

abline(lm(b[cc]~a[cc]),lwd=3)

Here a and b are my values and cc are the points within the densest parts (i.e. most points lay there), red+yellow+blue.

enter image description here

Why doesn't my regression line look more like that on the right (hand-drawn fit)? If I was plotting a line of best fit it would be there?

I have numerous plots similar to this but still I get the same results....

enter image description here

Are there any alternative linear regression fits that could prove to be better for me?

  • 3
    The thick line on the right-hand plot looks like the one you'd get a the first principal component of the data points. – Gavin Simpson Mar 08 '14 at 17:44
  • Great illustration of your thoughts. Will use this in teaching likely... What's interesting to me: usually I'd say this question sshould be on cross validated but at the same time I doubt that you would have gotten such a good answer this quickly. – Matt Bannert Mar 09 '14 at 08:20
  • would you mind sharing a reproducible example? – Matt Bannert Mar 09 '14 at 23:41

1 Answers1

8

A linear regression is a method to fit a linear function to a set of points (observations) minimizing the least-squares error.

Now imagine your heatmap indicating a shape where you would assume a vertical line fitting best. Just turn your heatmap 10 degrees counter clock-wise and you have it.

Now how would a linear function supposed to be defined which is vertical? Exactly, it is not possible.

The result of this little thought experiment is that you confuse the purpose of linear regression and what you most likely want is - as indicated already by Gavin Simpson - the 1st principal component vector.

Raffael
  • 19,547
  • 15
  • 82
  • 160
  • Great answer. For entertainment, @PhillipPhillipson, you might try weighted-least squares with the weights determined by the local density function you used to generate the heat map. That's another way of "adjusting" your data based on exterior knowledge -- I'm not in the least claiming that it's better than PCA. – Carl Witthoft Mar 08 '14 at 21:37
  • I never even knew such regression models existed! The PCA did a much better job. – PhillipPhillipson Mar 16 '14 at 21:19