6

I'm working on an iPhone app that involves certain physics calculations that are done thousands of times per second. I am working on optimizing the code to improve the framerate. One of the pieces that I am looking at improving is the inverse square root. Right now, I am using the Quake 3 fast inverse square root method. After doing some research, however, I heard that there is a faster way by using the NEON instruction set. I am unfamiliar with inline assembly and cannot figure out how to use NEON. I tried implementing the math-neon library but I get compiler errors because most of the NEON-based functions lack return.

EDIT: I've suddenly been getting some "unclear question" close votes. Although I think its quite clear and those who answered obviously understood, maybe some people need it stated explicitly: How do you use Neon to perform faster calculations? And is it really the fastest method for getting the inverse square root on the iPhone?

EDIT: I did some more formal testing on Neon VS Quake today, but If anything, I'm even more uncertain about the outcome now:

  • In-App Testing: (An app that is currently in the app store with its invsqrt method modified)

    1. Quake Method (leading by a marginal increase in average FPS under stressful conditions)
    2. Neon (It was a really close call but it seemed that Quake was slightly faster)
    3. 1/sqrtf() (a bit more noticeable difference, 1-3 FPS drop).
  • "Formal" Testing (An app that devours my Phone's CPU. Times how long it takes each method to get through an array of 10000000 randomly generated floats)

    1. Neon (clearly the fastest, and double the speed if it is used to do two sqrts at once).
    2. 1/sqrtf() (Only marginally slower than Neon. This surprising result leads me to deem this test "inconclusive" until I investigate further)
    3. Quake (This method, surprisingly, was a few orders of magnitude slower than the other two methods. This is especially surprising given its performance in the other test.)

While quake vs neon was too close to say anything for sure in the app performance test, the quake vs 1/sqrtf() was quite clearly cut out in the first test, and the second test was extremely consistent with the values it outputted. What is important in the end, though, is app performance, so I'm going to make my final decision based on that test.

Community
  • 1
  • 1
WolfLink
  • 3,308
  • 2
  • 26
  • 44
  • Have you run your app under instruments to see where the time is actually being spent? It's exceedingly unlikely that you're spending a significant portion of CPU time doing "thousands" of (inverse) square roots per second, unless you mean something like "three hundred thousand thousands". – Stephen Canon Jan 12 '14 at 12:55
  • "Stressful Conditions" in my app is running the calculation about 122500 times per second. In my goal scenario, it will run 594000 times per second. Changing the square root method has had a noticeable effect, but there are other bottlenecks that I am working on as well. – WolfLink Jan 12 '14 at 19:13

2 Answers2

5

The accepted answer of the question you've linked already provides the answer, but doesn't spell it out:

#import <arm_neon.h>

void foo() {
    float32x2_t inverseSqrt = vrsqrte_f32(someFloat);
}

Header and function are already provided by the iOS SDK.

Community
  • 1
  • 1
DarkDust
  • 90,870
  • 19
  • 190
  • 224
  • How do I convert between `float32x2_t` and `float`? I can't find any good documentation on what `float32x2_t` is exactly. – WolfLink Jan 10 '14 at 08:03
  • It's actually a `float32_t` which is a `float` (in Xcode, hold the command key and click on a type to jump to its definition). I've edited my answer accordingly. – DarkDust Jan 10 '14 at 08:05
  • Compiler error: "Passing 'float' to parameter of incompatible type float32x2_t" – WolfLink Jan 10 '14 at 08:11
  • 1
    Found answer: `float32x2_t vectorFloat = {someFloat, 1.0f};` and `float outputFloat = vectorFloat[0];` So it could be used to do two floats simultaneously, although that isn't useful to me with my current code structure. – WolfLink Jan 10 '14 at 08:23
  • @WolfLink Is it really faster? Did you do any benchmark test for it or is it just faster to the eye? Thanks in advance. – Unheilig Jan 11 '14 at 23:26
  • @Unheilig It doesn't seem to be any faster than the Quake method I was using, but I haven't done any formal tests on it yet. I compared the framerate of my app under stressful conditions between quake and neon invsqrt methods, and if there was any difference, quake was actually faster. However, neon can do two calculations at once, so I'm trying to rework my code to make use of that. I think I will get a speed boost then, I'll edit my question with the new information when I'm done with that. – WolfLink Jan 11 '14 at 23:36
  • @WolfLink Thanks for reply. Look forward to your update when you're done. I already upvoted you (yup, that vote was from me :-). If I could, I would do it again. – Unheilig Jan 11 '14 at 23:40
  • 1
    @Unheilig so Quake seemed best when used in my app, the Neon and then 1/sqrtf, but when I did a more controlled experiment, the order was neon>1/sqrtf>quake, and the difference between quake and the other two was huge. (2 seconds vs 2 milliseconds) – WolfLink Jan 12 '14 at 10:11
2

https://code.google.com/p/math-neon/source/browse/trunk/math_sqrtf.c <- there's a neon implementation of invsqrt there, you should be able to copy the assembly bit as-is

Fjölnir
  • 490
  • 2
  • 12
  • I am new to inline assembly. How do I get output from that and how do I give it input? – WolfLink Jan 10 '14 at 07:54
  • From what I can tell it just reads the parameter from the first param register. But you should take a look at the function DarkRust mentioned: vrsqrte_f32 – Fjölnir Jan 10 '14 at 07:56
  • @WolfLink: The function linked by @fyolnish seems to be the better/more precise implementation (as far as I've understood some comments in the other question, the `vrsqrte_f32` alone is not enough for a precise result). You _could_ simply copy the whole function `sqrtf_neon_hfp` (I don't understand what the wrapper `sqrtf_neon_sfp` is doing). It already does all you'd need. However, the problem is that the file is licensed under LGPL3 which is incompatible with the iOS AppStore, so you'd be violating the LGPL3 by copying the function... – DarkDust Jan 10 '14 at 08:01
  • Is `vrsqrte_f32` just a fast iteration of Newton's method? – WolfLink Jan 10 '14 at 08:16
  • How should I make the result from vrsqrte_f32 more precise? – WolfLink Jan 10 '14 at 08:21