I was testing some code which, among other things, runs a linear regression of the form y = m * x + b
on some data. To keep things simple, I set my x and y data equal to each other, expecting the model to return one for the slope and zero for the intercept. However, that's not what I saw. Here's a super boiled-down example, taken mostly from the numpy docs:
>>> y = np.arange(5)
>>> x = np.arange(5)
>>> A = np.vstack([x, np.ones(5)]).T
>>> np.linalg.lstsq(A, y)
(array([ 1.00000000e+00, -8.51331872e-16]), array([ 7.50403936e-31]), 2, array([ 5.78859314, 1.22155205]))
>>> # ^slope ^intercept ^residuals ^rank ^singular values
Numpy finds the exact slope of the true line of best fit (one), but reports an intercept that, while very very small, is not zero. Additionally, even though the data can be perfectly modeled by a linear equation y = 1 * x + 0
, because this exact equation is not found, numpy reports a tiny but non-zero residual value.
As a sanity check, I tried this out in R (my "native" language), and observed similar results:
> x <- c(0 : 4)
> y <- c(0 : 4)
> lm(y ~ x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-3.972e-16 1.000e+00
My question is, why and under what circumstances does this happen? Is it an artifact of looking for a model with a perfect fit, or is there always a tiny bit of noise added to regression output that we usually just don't see? In this case, the answer is almost certainly close enough to zero, so I'm mainly driven by academic curiosity. However, I also wonder if there are cases where this effect could be magnified to be nontrivial relative to the data.
I've probably revealed this by now, but I have basically no understanding of lower-level programming languages, and while I once had a cursory understanding of how to do this sort of linear algebra "by hand", it has long ago faded from my mind.