2

I am working on a problem that needs best fit line, though the best fit should be applied on initial part of the data and not the whole datapoints.

I know how to curve-fit a function on a dataset when I want to apply it to the whole dataset but don't know how to implement this in my problem.

What I want to do is:

  • find the point after which the dataset deviates from a straight line which can change in each dataset (my main problem)
  • find the best fit (straight line) on the datapoints (the blue line in the sketch below).

Please let me know your thoughts.

enter image description here

Ali Wali
  • 179
  • 2
  • 8

3 Answers3

0

I think in this case an iterative search algorithm could do the job. You just keep adding the next data point to the line as long as the vector pointing from the previously added data point to the next one doesn't deviate too much from the previously added ones.

For that, you would have to trust that the first 3 or so data points lie along a line (to get a base line directional estimate) and then check whether the direction from the previously added data point towards the next deviates by more than 10% or 20% or so from the mean direction of all the points being part of the preliminary line so far.

This includes some hyperparameter fitting, like the percentage of tolerated deviation. But I am personally not aware of any other out of the box solutions to this problem.

Finally, including (only) all the added points, you could apply the actual line fitting algorithm of your choice (probably linear regression).

Daniel B.
  • 659
  • 1
  • 9
  • 17
  • Thanks for your comment @Daniel B.. The issue with iterative search is the type of the data I'm dealing with which can consist of thousands of datapoints. I put the example here just as an illustration. I tried the iterative search and was not able to find the deviation point. – Ali Wali Jun 22 '20 at 22:32
  • What about some local search? Let's say you are always considering a (running) window of 100(0) consecutive data points or so and split it into two parts. For each party, you compute the set of direction vectors from each data point to the next. The first time that this comparison of 2sets of 49directional vectors each differs significantly (given some statistical test), you take that as an indication for a transition point from one distribution to another. After all, these different 'lines' in the data may be assumes to stem from statistically significantly different distributions. – Daniel B. Jun 22 '20 at 22:44
  • Averaging over multiple consecutive directional vectors per partition of your running window and using a statistical significance test can then be seen as a means of filtering out noise. – Daniel B. Jun 22 '20 at 22:46
  • And to make working with vectors easier in statistical tests, you could transform them into rotation angles with respect to some global coordinate system that you "overlay" over your data. Then, you test two sets of rotation angles for their statistical difference. – Daniel B. Jun 22 '20 at 22:50
0

You are essentially looking for an elbow. The most simplistic way to do this is to fit the dataset to two lines, and iterating the span of each line to move from one end to the other. You then choose the highest average R (or lowest residual), and you have a best fit for both trendlines. Some code:

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
R = []
x = np.linspace(1,10,10).reshape((-1, 1))
y = np.linspace(1,5,5)
y=np.append(y,np.linspace(6,20,5))
for i in range(1,len(x)-1):
    l1x = x[:i]
    l2x = x[i+1:]
    l1y = y[:i]
    l2y = y[i+1:]
    model1 = LinearRegression().fit(l1x, l1y)
    model2 = LinearRegression().fit(l2x, l2y)
    R.append((model1.score(l1x, l1y)+model2.score(l2x, l2y))/2)
Nic Thibodeaux
  • 165
  • 2
  • 12
  • Thanks Nic. But not exactly an elbow. What you can see here is just a simplified version of my data. The points after the deviation point can get any shape and trend and not necessarily can be fitted on a straight line. – Ali Wali Jun 22 '20 at 22:52
  • Gotcha, I think you could modify this then and only take the model1 R value. You should expect a generally high R until the deviation begins. To really be safe, you could begin the span with the first 5 points (or whatever you are comfortable with), and increase the span from there. The deviation point should be the index where the R goes below a high value. – Nic Thibodeaux Jun 22 '20 at 23:06
-1

I'd recommend you find a way to detect outliers (there are many methods) and then calculate the line of best fit ignoring the outliers.

Finding where the dataset deviates from the line of best fit is a difficult task especially if a lot of your data ends off the line as in the pic.

Leo Denham
  • 43
  • 4