0

Following this post, I now have serious doubts if R-squared or F-test are good indications of a good linear fit into some data with random noise. Hence, I want to develop a custom made regression function so I can both learn how it works and maybe improve upon the existing tools.

Consider these randomly generated ndarrays x and y:

import numpy as np

np.random.seed(42)

x = np.random.rand(30) * 10
y = 1.5 * x + 0.3 + (np.random.rand(30) - 0.5) * 3.5

now I can define the average/mean absolute deviation of any set of data points with:

def aad(X, Y, a, b): # assumes X and Y are of the identical shape/size
    n = X.size # highly unsafe!
    U = (a * X + Y - b) / 2 / a
    V = (a * X + Y + b) / 2
    E = np.sqrt(np.power((X - U), 2) + np.power((Y - V), 2))
    return E.sum() / n

which in my opinion is the best way to quantify the fitness of a line of y = a * x + b into the pair of data points. The function simply finds the closest point the assumed line to any data point and then calculates the perpendicular distance between the point and the line.

Now I need to have a function of let's say:

linearFit(X, Y)

which given the identically shaped ndarrays of X and Y, finds the a and b which make the aad(X, Y, a, b) minimum. It is important that the result to be an absolute minimum not just a local one.

Of course in the spirit of SO's best practices, I have already tried the scipy.optimize functions fmin and brute, as you may see in the above-mentioned post and also here. However, it seems that I can't get my head around the right syntax for those functions. I would appreciate it if you could help me find a canonical and performant implementation for the presumed linearFit function. Thanks for your support in advance.

P.S. A temporary workaround offered here:

from scipy.optimize import minimize

aad_ = lambda P: aad(P[0], P[1], x1, y1)
minimize(aad_, x0=[X0, Y0])

however, the results I'm getting are not that promising! The solver does not succeed and I get the message:

Desired error not necessarily achieved due to precision loss

Foad S. Farimani
  • 12,396
  • 15
  • 78
  • 193
  • 2
    There are a lot of methods to calculate a least absolute deviation regression, just google some algorithms for it. It’s an iterative problem though and has some short comings compared to ols – bryan60 Feb 25 '20 at 02:02
  • @bryan60 ok, that's what it is called. I didn't know that. I also didn't know the common method is called [Ordinary least squares (OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression. Thanks. Is there any concise way to to it with the `scipy.optimize` functions? – Foad S. Farimani Feb 25 '20 at 02:05
  • I’ve really never had a use case that called for an L1 regression. Removing outliers will give you roughly the same result. You’re not Gona get good fit to random data. That’s kind of the point of random data. Any correlation would be spurious. – bryan60 Feb 25 '20 at 02:27
  • @bryan60 edited the post. it is not really random data, but data with random noise. – Foad S. Farimani Feb 25 '20 at 02:29
  • 1
    If the noise is random then you absolute vs squared errors isn’t going to make much difference. Absolute is strongest when outliers should have less weight. You’ll have a better fit if you can measure the cause of the noise and use it in a multivariate regression. Or if the noise truly is random as in signal noise, there are other techniques. – bryan60 Feb 25 '20 at 02:38
  • @bryan60 Thanks a lot. One of the goals of asking this question is that by implementing this custom function I can learn both math and also the `scipy.optimize`. Thanks for your support. – Foad S. Farimani Feb 25 '20 at 02:44
  • @bryan60 How do you think about [my below answer](https://stackoverflow.com/a/60441689/4999991). Your feedback would be highly appreciated. – Foad S. Farimani Feb 27 '20 at 21:06

1 Answers1

0

First of all, thanks to this post I realized that this is not an ordinary least squares (OLS) regression as was discussed in the comments above. It is actually called by many names among which Deming regression, orthogonal distance regression (ODR), and total least squares (TLS). Also there is, of course, a Python package scipy.odr for that as well! Its syntax is a bit weird and the documentation is not much of a help, but a good tutorial can be found here.

Nex I found a small bug in the aad definition and renamed and fixed it to:

def aaod(a, b, X, Y): # assumes X and Y are of the identical shape/size
    n = X.size # still highly unsafe! don't use it in real production
    U = (a * X + Y - b) / 2 / a
    V = (a * X + Y + b) / 2
    E = np.sqrt(np.power((X - U), 2) + np.power((Y - V), 2))
    return E.sum() / n

standing for average absolute orthogonal distance. Now defining our fitting function as:

from scipy.optimize import minimize
from scipy.stats import linregress

def odrFit(X, Y):
    X0 = linregress(X, Y) # wait this is cheating!
    aaod_ = lambda P: aaod(P[0], P[1], X, Y)
    res = minimize(aaod_, x0=X0[:2], method = 'Nelder-Mead')
    res_list = res.x.tolist()
    res_list.append(aaod_(res_list))
    return res_list

which is not necessarily the most performant and canonical implementation. The workaround with the temporary lambda function I learned from here and the method = 'Nelder-Mead' from here. The scipy.odr implementation can also be done as:

from scipy.odr import Model, ODR, RealData

def f(B, x):
    return B[0]*x + B[1]

linear = Model(f)
mydata = RealData(x, y)
myodr = ODR(mydata, linear, beta0=[1., 2.])
myoutput = myodr.run()

Now comparing the result between our custom-made odrFit() function and scipy.stats.linregress():

slope, intercept, r_value, p_value, std_err = linregress(x,y)

print(*odrFit(x, y)) 
# --> 1.4804181575739097, 0.47304584702448255, 0.6008218016339527

print(slope, intercept, aaod(slope, intercept, x, y))
# --> 1.434483032725671 0.5747705643012724 0.608021569291401

print(*myoutput.beta, aaod(*myoutput.beta, x, y))
# --> 1.5118079563432785 0.23562547897245803 0.6055838996104704

which shows our function is actually more accurate than the least absolute deviation regression method of Scipy. This can actually be just pure luck and more tests need to be done to draw a reliable conclusion. The complete code can be found here.

Foad S. Farimani
  • 12,396
  • 15
  • 78
  • 193