7

I have 350 document scores that, when I plot them, have this shape:

docScores = [(0, 68.62998962), (1, 60.21374512), (2, 54.72480392), 
             (3, 50.71389389), (4, 49.39723969), ...,  
             (345, 28.3756237), (346, 28.37126923), 
             (347, 28.36397934), (348, 28.35762787), (349, 28.34219933)]

I posted the complete array here on pastebin (it corresponds to the dataPoints list on the code below).

Score distribution

Now, I originally needed to find the elbow point of this L-shape curve, which I found thanks to this post.

Now, on the following plot, the red vector p represents the elbow point. I would like to find the point x=(?,?) (the yellow star) on the vector b which corresponds to the orthogonal projection of p onto b.

enter image description here

The red point on the plot is the one I obtain (which is obviously wrong). I obtain it doing the following:

b_hat = b / np.linalg.norm(b)    #unit vector of b
proj_p_onto_b = p.dot(b_hat)*b_hat
red_point = proj_p_onto_b + s

Now, if the projection of p onto b is defined by the its starting and ending point, namely s and x (the yellow star), it follows that proj_p_onto_b = x - s, therefore x = proj_p_onto_b + s ?

Did I make a mistake here ?

EDIT : In answer to @cxw, here is the code for computing the elbow point :

def findElbowPoint(self, rawDocScores):
    dataPoints = zip(range(0, len(rawDocScores)), rawDocScores)
    s = np.array(dataPoints[0])
    l = np.array(dataPoints[len(dataPoints)-1])
    b_vect = l-s
    b_hat = b_vect/np.linalg.norm(b_vect)
    distances = []
    for scoreVec in dataPoints[1:]:
        p = np.array(scoreVec) - s
        proj = p.dot(b_hat)*b_hat
        d = abs(np.linalg.norm(p - proj)) # orthgonal distance between b and the L-curve
        distances.append((scoreVec[0], scoreVec[1], proj, d))

    elbow_x = max(distances, key=itemgetter(3))[0]
    elbow_y = max(distances, key=itemgetter(3))[1]
    proj = max(distances, key=itemgetter(3))[2]
    max_distance = max(distances, key=itemgetter(3))[3]

    red_point = proj + s

EDIT : Here is the code for the plot :

>>> l_curve_x_values = [x[0] for x in docScores]
>>> l_curve_y_values = [x[1] for x in docScores]
>>> b_line_x_values = [x[0] for x in docScores]
>>> b_line_y_values = np.linspace(s[1], l[1], len(docScores))
>>> p_line_x_values = l_curve_x_values[:elbow_x]
>>> p_line_y_values = np.linspace(s[1], elbow_y, elbow_x)
>>> plt.plot(l_curve_x_values, l_curve_y_values, b_line_x_values, b_line_y_values, p_line_x_values, p_line_y_values)
>>> red_point = proj + s
>>> plt.plot(red_point[0], red_point[1], 'ro')
>>> plt.show()
Community
  • 1
  • 1
Floran Gmehlin
  • 824
  • 1
  • 11
  • 34

2 Answers2

4

If you are using the plot to visually determine if the solution looks correct, you must plot the data using the same scale on each axis, i.e. use plt.axis('equal'). If the axes do not have equal scales, the angles between lines are distorted in the plot.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • Thank you for this, I have been trying to find the projection for 1 hour. I thought the equation is wrong, I checked it more than 10 times. Eventually, the problem was with the matplotlib, and your answer helped me to figure this out. Thanks! – Ahmed Tarawneh May 11 '21 at 22:15
1

First of all, is the point at ~(50, 37) p or s+p? If p, that might be your problem right there! If the Y component of your p variable is positive, you won't get the results you expect when you do the dot product.

Assuming that point is s+p, if a bit of Post-It scribbling is correct,

p_len = np.linalg.norm(p)
p_hat = p / p_len
red_len = p_hat.dot(b_hat) * p_len   # red_len = |x-s|
    # because p_hat . b_hat = 1 * 1 * cos(angle) = |x-s| / |p|
red_point = s + red_len * b_hat

Not tested! YMMV. Hope this helps.

cxw
  • 16,685
  • 2
  • 45
  • 81
  • Hi! Thanks for your answer. I think the problem indeed lies with `p`, as I used the POINT `p` to make the dot product instead of the VECTOR `p` (which is `s+p`) ! I will try your suggestion and get back at you when I get the result :) – Floran Gmehlin Oct 06 '16 at 12:16
  • Aaah, ignore last comment, the vector `p` is obtained by doing `p(50, 37) - s(0,60)` is that not correct ? – Floran Gmehlin Oct 06 '16 at 12:20
  • Using your code, I get the same `red_point` that I got previously. weirdly, using `p` or `p+s` both give me the same red point :( – Floran Gmehlin Oct 06 '16 at 12:24
  • I edited the question with the code, if you need the full data, let me know and I will pastebin it – Floran Gmehlin Oct 06 '16 at 12:33
  • 1
    @FloranGmehlin Thank you again! Since the red point is the one that appears at the wrong location, would you please also add the code for plotting the red point? I have looked at the posted code in detail and cannot find an error (which is both good and bad news :) ). Much appreciated! – cxw Oct 06 '16 at 13:18