Following this post, I now have serious doubts if R-squared or F-test are good indications of a good linear fit into some data with random noise. Hence, I want to develop a custom made regression function so I can both learn how it works and maybe improve upon the existing tools.
Consider these randomly generated ndarrays x
and y
:
import numpy as np
np.random.seed(42)
x = np.random.rand(30) * 10
y = 1.5 * x + 0.3 + (np.random.rand(30) - 0.5) * 3.5
now I can define the average/mean absolute deviation of any set of data points with:
def aad(X, Y, a, b): # assumes X and Y are of the identical shape/size
n = X.size # highly unsafe!
U = (a * X + Y - b) / 2 / a
V = (a * X + Y + b) / 2
E = np.sqrt(np.power((X - U), 2) + np.power((Y - V), 2))
return E.sum() / n
which in my opinion is the best way to quantify the fitness of a line of y = a * x + b
into the pair of data points. The function simply finds the closest point the assumed line to any data point and then calculates the perpendicular distance between the point and the line.
Now I need to have a function of let's say:
linearFit(X, Y)
which given the identically shaped ndarrays of X
and Y
, finds the a
and b
which make the aad(X, Y, a, b)
minimum. It is important that the result to be an absolute minimum not just a local one.
Of course in the spirit of SO's best practices, I have already tried the scipy.optimize
functions fmin
and brute
, as you may see in the above-mentioned post and also here. However, it seems that I can't get my head around the right syntax for those functions. I would appreciate it if you could help me find a canonical and performant implementation for the presumed linearFit
function. Thanks for your support in advance.
P.S. A temporary workaround offered here:
from scipy.optimize import minimize
aad_ = lambda P: aad(P[0], P[1], x1, y1)
minimize(aad_, x0=[X0, Y0])
however, the results I'm getting are not that promising! The solver does not succeed and I get the message:
Desired error not necessarily achieved due to precision loss