4

So here's an odd question. I'm working on a kNN problem and need to find the nearest neighbor. I'm looking into distance, but once again, I don't care about the actual distance, just which one is closest. However, since distance can't be negative, I need to either square or take the absolute value of the distance.

So here are two options for how to accomplish this:

//note: it's been abstracted for multiple dimensions (not just x and y)
for(int i = 0; i < (numAttributes - 1); i++)
{
    distance += Math.pow((a.value(i) - b.value(i)), 2);
}

and

//note: it's been abstracted for multiple dimensions (not just x and y)
for(int i = 0; i < (numAttributes - 1); i++)
{
    distance += Math.abs(a.value(i) - b.value(i));
}

My question is which is faster. Since this is a data mining application, I want it to be able to process the information as quickly as possible. And while I understand, that in the guts, a power of two can be implemented with a shift, I'm not sure this is the case in such a high level language like Java where it gets translated for the JVM. Is there a reason why one is better than the other?

  • 2
    Which is faster? How about you time them both and find out? Or compile the two lines and see which one generated more instructions. – Tdorno Apr 30 '15 at 00:35
  • 1
    doesn't using the L1 norm vs the L2 norm have statistical implications? – Transcendence Apr 30 '15 at 00:37
  • After timing it, I came out with the absolute value as faster, but this seems to contradict what I thought would be the case with the shift capability in low level assembly. I would have thought that the JVM would have caught that. Is there a reason why it's so much slower to find the power of 2? Is there a reason a shift would be so slow in Java? After all, it seems to be only half as fast in C++ (compared to three times as fast in Java). – Mackenzie Bodily Apr 30 '15 at 01:02
  • 3
    What? Math.pow(..., 2) isn't anything to do with shifting, and doesn't compute powers of two (which would be Math.pow(2, ...)) – user253751 Apr 30 '15 at 01:31
  • 2
    Note that *you will get different results* with both of these. If you're implementing an algorithm that requires squaring, then using abs isn't just faster, it's *wrong* (although depending on lots of things, it might be a good enough approximation). – user253751 Apr 30 '15 at 01:32
  • 3
    **1.** You cannot compute a _square_ with a single shift. You can compute _an integer power of 2_ with a shift, but that's of no use here. **2.** I expect `Math.abs(d)` to be _much_ faster than `Math.pow(d, 2)`. Why? Because `abs` should map to a single AND that clears the sign bit, while `pow(, 2)` maps at best to a multiplication and at worst to an actual function call to `pow`. **3.** immibis is right: L1 and L2 norms have very different properties; In particular L2 is differentiable everywhere, while L1 is not (the spike at 0), which leads to a whole host of problems in some applications. – Iwillnotexist Idonotexist Apr 30 '15 at 01:43
  • @IwillnotexistIdonotexist "abs should map to a single AND" - HOW? – apangin Apr 30 '15 at 08:21
  • 1
    @apangin If you read up on the IEEE-754 floating-point number representation you'll learn it's essentially a _sign_-and-(_magnitude_-times-2-to-the-power-of-_exponent_) representation. For that reason, clearing the sign bit is sufficient to get the absolute value. – Iwillnotexist Idonotexist Apr 30 '15 at 10:09

1 Answers1

5

First, consider vectors A=[0,0,0], B=[1,1,1], C=[0,0,2]. Which one is closer to A? Is it B or C? Actually, caring about the distance measure is absolutely crucial in kNN. And we are only talking about Manhattan and euclidean distances. You could, by example, use the cosine similarity as well, and you should select the distance measure carefully, taking the into account the knowledge you have about your data.

Second, instead of such a low-level optimization, consider something smarter. Such as breaking your for(int i = 0; i < (numAttributes - 1); i++) loop as soon as too large distance is detected.

Third, using Math.pow(a,2) to compute a*a is definitely very inefficient.

Fourth, i < (numAttributes - 1)? Didn't you mean i < numAttributes??

Tregoreg
  • 18,872
  • 15
  • 48
  • 69
  • The last attribute is non numeric and contains the correct answer (I guess you could say), so if I count that I get a NAN exception. But, looking at it I should definitely cut that out of the loop as well as get a break statement in there to shorten it. Thank you! – Mackenzie Bodily Apr 30 '15 at 01:43
  • "using Math.pow(a,2) to compute a*a is definitely very inefficient". That's not true. Look [here](http://stackoverflow.com/questions/29144275/xx-vs-math-powx-2-java-performance) – apangin Apr 30 '15 at 08:23
  • @apangin Oh right, very interesting, didn't expect the compiler to do the work :) – Tregoreg Apr 30 '15 at 14:06