1

I have a discrete dataset and plot it below. I would like to find the point before the curve has a steep drop, which should be around index = 600. I have read this post and tried to use maximum absolute second derivative to find the point, but the result is wrong because there are some 'bumpy' points, and I am not sure whether the point I want to find is called elbow point or not.

I am thinking about smoothing the curve. But before smoothing, is there any other approach I can try?

plt.scatter(range(len(score)),score)
plt.axvline(x=600,linestyle='--')

plt.xlabel('Index')
plt.ylabel('Scores')
plt.show()

enter image description here

sevendsds
  • 21
  • 2
  • Since you have tagged this with `python` can you give us some information on how you are storing your data and any libraries you are using currently? – MyNameIsCaleb Sep 25 '19 at 01:30
  • I attached the python code of plotting. Basically score is a list with length = 1000 – sevendsds Sep 25 '19 at 02:15
  • Finding the elbow is equivalent to making an assumption that the data are modeled by two different functions, one on either side of the elbow. The right way to go about this is to say what you think those functions might be (i.e. specify functional form with some free parameters) and look for the best fit to the data, letting the elbow point vary over the range of the data. Such models are sometimes called "change point" models. I know those are discussed in Seber & Wild, "Nonlinear Regression", probably many other books and papers too. – Robert Dodier Sep 25 '19 at 02:50
  • A quick approximation which might or might not be good enough is to assume the model is linear on either side of the elbow. That is, assume the data make a line with a kink in it, and select the elbow point as the location of the kink which minimizes total mean-square error. – Robert Dodier Sep 25 '19 at 02:52

1 Answers1

0

Perhaps this will work; it is just based on approximating the derivative with a "central difference" i.e. for an index i, the derivative at that point is calculated using the points h behind and h in front (score[i+h]-score[i-h])/(2*h).

def derivatives(score, h=100):
    score_derivatives = {}
    for i in range(h,len(score)-h):
        first_derivative = (score[i+h]-score[i-h])/(h*2)
        score_derivatives[i] = first_derivative
    return score_derivatives

This will return you a dictionary of derivatives via central difference at each index. You will just need to loop through this dictionary to find the first index with a value less than a preset threshold (the gradient of your "steep drop").

If h is large enough (I've set the default to 100), this method will be robust against noise.

  • Thanks! Your method is very straightforward. But I am looking for a more generalized one. If my dataset changes, the threshold should be reset depending on what the plot looks like. – sevendsds Sep 25 '19 at 05:01
  • You will want to also approximate the second derivative (via a differencing method) then in this case, see https://en.wikipedia.org/wiki/Finite_difference#Higher-order_differences –  Sep 25 '19 at 06:15