Numpy and R give non-zero intercept in linear regression when x = y

Question

I was testing some code which, among other things, runs a linear regression of the form y = m * x + b on some data. To keep things simple, I set my x and y data equal to each other, expecting the model to return one for the slope and zero for the intercept. However, that's not what I saw. Here's a super boiled-down example, taken mostly from the numpy docs:

>>> y = np.arange(5)
>>> x = np.arange(5)
>>> A = np.vstack([x, np.ones(5)]).T
>>> np.linalg.lstsq(A, y)
(array([  1.00000000e+00,  -8.51331872e-16]), array([  7.50403936e-31]), 2, array([ 5.78859314,  1.22155205]))
>>> #     ^slope           ^intercept                  ^residuals        ^rank    ^singular values

Numpy finds the exact slope of the true line of best fit (one), but reports an intercept that, while very very small, is not zero. Additionally, even though the data can be perfectly modeled by a linear equation y = 1 * x + 0, because this exact equation is not found, numpy reports a tiny but non-zero residual value.

As a sanity check, I tried this out in R (my "native" language), and observed similar results:

> x <- c(0 : 4)
> y <- c(0 : 4)
> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
 -3.972e-16    1.000e+00

My question is, why and under what circumstances does this happen? Is it an artifact of looking for a model with a perfect fit, or is there always a tiny bit of noise added to regression output that we usually just don't see? In this case, the answer is almost certainly close enough to zero, so I'm mainly driven by academic curiosity. However, I also wonder if there are cases where this effect could be magnified to be nontrivial relative to the data.

I've probably revealed this by now, but I have basically no understanding of lower-level programming languages, and while I once had a cursory understanding of how to do this sort of linear algebra "by hand", it has long ago faded from my mind.

I think this is basically http://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal. Have a look at `2 - (sqrt(2)^2)` — user20650, Mar 26 '15 at 15:11

WakkaDojo · Accepted Answer · 2015-04-09T22:50:16.913

3

It looks like numerical error, the y-intercept is extremely small.

Python, and numpy included, uses double precision floating point numbers by default. These numbers are formatted to having a 52 bit coefficient (see this for floating point explanation, and this for scientific notation explanation of "base")

In your case, you found a y-intercept of ~4e-16. As it turns out, a 52 bit coefficient has roughly 2e-16 accuracy. Basically, in the regression, you subtracted a number on the order of 1 from something closely resembling itself, and hit the numerical precision of double floating point.

edited Apr 09 '15 at 22:50

answered Mar 26 '15 at 15:05

WakkaDojo

431
3
7

Based on @user20650's comment and subsequent research, this seems like it's the right answer... but it's really too bare-bones to be of much use. Since it's been two weeks, I'd like to accept an answer, but I would really appreciate it if you'd update this to be a little more explanatory and useful. – Joe Apr 09 '15 at 17:58

Numpy and R give non-zero intercept in linear regression when x = y

1 Answers1