7

What is the relevance of Stack Overflow question/answer Why does changing 0.1f to 0 slow down performance by 10x? for Objective-C? If there is any relevance, how should this change my coding habits? Is there some way to shut off denormalized floating points on Mac OS X?

It seems like this is completely irrelevant to iOS. Is that correct?

Community
  • 1
  • 1
Dan Rosenstark
  • 68,471
  • 58
  • 283
  • 421

1 Answers1

16

As I said in response to your comment there:

it is more of a CPU than a language issue, so it probably has relevance for Objective-C on x86. (iPhone's ARMv7 doesn't seem to support denormalized floats, at least with the default runtime/build settings)

Update

I just tested. On Mac OS X on x86 the slowdown is observed, on iOS on ARMv7 it is not (default build settings).

And as to be expected, running on iOS simulator (on x86) denormalized floats appear again.

Interestingly, FLT_MIN and DBL_MIN respectively are defined to the smallest non-denormalized number (on iOS, Mac OS X, and Linux). Strange things happen using

DBL_MIN/2.0

in your code; the compiler happily sets a denormalized constant, but as soon as the (arm) CPU touches it, it is set to zero:

double test = DBL_MIN/2.0;
printf("test      == 0.0 %d\n",test==0.0);
printf("DBL_MIN/2 == 0.0 %d\n",DBL_MIN/2.0==0.0);

Outputs:

test      == 0.0 1  // computer says YES
DBL_MIN/2 == 0.0 0  // compiler says NO

So a quick runtime check if denormalization is supported can be:

#define SUPPORT_DENORMALIZATION ({volatile double t=DBL_MIN/2.0;t!=0.0;})

("given without even the implied warranty of fitness for any purpose")

This is what ARM has to say on flush to zero mode: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204h/Bcfheche.html

Update<<1

This is how you disable flush to zero mode on ARMv7:

int x;
asm(
    "vmrs %[result],FPSCR \r\n"
    "bic %[result],%[result],#16777216 \r\n"
    "vmsr FPSCR,%[result]"
    :[result] "=r" (x) : :
);
printf("ARM FPSCR: %08x\n",x);

with the following surprising result.

  • Column 1: a float, divided by 2 for every iteration
  • Column 2: the binary representation of this float
  • Column 3: the time taken to sum this float 1e7 times

You can clearly see that the denormalization comes at zero cost. (For an iPad 2. On iPhone 4, it comes at a small cost of a 10% slowdown.)

0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 110 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 110 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 110 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 110 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 111 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 110 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 110 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 110 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 110 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 110 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 110 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 110 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 110 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 111 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 110 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 110 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 110 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 110 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 112 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 110 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 110 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 110 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 111 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 110 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 110 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 110 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 110 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 110 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 110 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 111 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 111 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 110 ms
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mvds
  • 45,755
  • 8
  • 102
  • 111
  • I'm curious though, does arm7 just flush to zero (always)? – Mysticial Feb 19 '12 at 16:16
  • So how can you avoid the slowdown on OS X? – Dan Rosenstark Feb 19 '12 at 17:19
  • 1
    @Yar: I would say: compile with `-ffast-math`, but no matter what flags I set, it refuses to flush to zero. – mvds Feb 19 '12 at 17:29
  • @Mysticial: looking at the docs, it always does so for NEON instructions. So it must be possible to get denormals still. – mvds Feb 19 '12 at 17:30
  • @Mysticial: see my updated answer, you can disable flush to zero using inline assembly. there may be a compile flag for it as well. – mvds Feb 19 '12 at 18:24
  • My last question -- I promise I won't turn this into a separate question ;) -- does any of this matter in VM-based languages like Java? I'd guess not... – Dan Rosenstark Feb 20 '12 at 00:45
  • @Yar: I have been thinking about that one as well... how about *you* spending some time to find out ;-) – mvds Feb 20 '12 at 00:46
  • Just to make sure I'm in the right ballpark, in Java I would be checking if there's a slowdown dealing with numbers between `Float.MIN_NORMAL` and `Float.MIN_VALUE` compared with numbers between `Float.MIN_NORMAL` and, say, 1.0f? Correct me if I'm totally lost, please. – Dan Rosenstark Feb 20 '12 at 01:54
  • 1
    @Yar: not sure, just start at some small value and divide by 2 for every round. Then, if you find some interesting threshold, you can compare against predefined constants. – mvds Feb 20 '12 at 01:58
  • I'm not seeing any interesting differences. Over 1e9 runs after a warmup, dividing .5f by 2.0f is almost the same as dividing `MIN_NORMAL` or even `MIN_VALUE`. I wonder if I should add some randomness to make sure the JVM doesn't cheat. Or if I should cheat and put this question out to the world tomorrow during SO primetime, along with my crappy code sample. – Dan Rosenstark Feb 20 '12 at 02:29
  • 1
    @Yar: "In particular, the Java programming language requires support of IEEE 754 denormalized floating-point numbers and gradual underflow" (source: http://java.sun.com/docs/books/jls/second_edition/html/typesValues.doc.html) – mvds Feb 20 '12 at 02:39
  • 1
    @Yar: For java I get the exact same result. 50x slowdown once you get below `Float.MIN_NORMAL (~1E-38)` for `float`. – mvds Feb 20 '12 at 02:51
  • 1
    @Yar: and of course it depends on the CPU; previous comment was about x86, if you run on ARM (dalvik VM, Android) you can get down to the denormalized `1E-45` without slowdown. – mvds Feb 20 '12 at 03:01
  • Alright, I'll have to (understand and) port your code from the original question and give it a whirl. Not sure why I didn't see any slowdown at all, but my test code was very limited. Thanks for all the help, it's been inspirational. – Dan Rosenstark Feb 20 '12 at 03:54
  • 1
    Oh I get it now. Here's my port (with improvements?): http://pastebin.com/2ZDvdCDv. So just by avoiding numbers between MIN_NORMAL and MIN_VALUE you can speed up your code by a factor of 20+, even in Java. Next I'll have to try Ruby :) – Dan Rosenstark Feb 21 '12 at 17:15
  • 1
    @Yar: nice job. But you really don't need that particular algorithm. You can get a clean measurement by simply timing `for(i=0;i<10000000;i++)sum+=f;`, with ever smaller values of `f`. Be sure to do something with `sum` because otherwise it is optimized away. – mvds Feb 21 '12 at 22:35
  • @Yar: The big question for java is, given a CPU with no support for denormalized floats (if any) how does the java vm solve this without a severe slowdown? – mvds Feb 23 '12 at 09:07
  • @mvds I really thought the VM was a thicker abstraction layer, so this has opened my eyes a LOT. So, to sum up: in all the cases where the slowdown takes longer to appear (like ARM), the CPU has built-in support for a more precise floating point arithmetic? Also: if your language doesn't have a way to shut it off, then you have to take the hit, because it's no doubt faster than wrapping all of your floats in a float wrapper that doesn't allow denormal values... right? – Dan Rosenstark Feb 23 '12 at 15:47
  • @Yar yeah, sounds like it. I think the fact that you "feel" this through the VM is a good thing. (ps. the slowdown doesn't appear *at all* on ARM) Does java have a way to shut it off? – mvds Feb 23 '12 at 22:38
  • @mvds looks like there is no way to shut it off, except to avoid those numbers :) – Dan Rosenstark Mar 12 '12 at 16:55
  • great, but what is the code to re-enable flush-to-zero afterwards? In all related questions, there are several techniques presented to control processor denormal number behavior - but no one says anything about the scope of this setting. In this example, you set the processor behavior in runtime - so this changes also other threads calculations? only my process? other processes? I wonder... no one speaks of the scope. – Motti Shneor Dec 28 '15 at 23:03
  • @MottiShneor re-enabling FTZ is easiest by changing `BIC` (Bit Clear) to `ORR` (OR) in the same code snippet. I'm no expert, but the scope is typically local I guess, no sane OS would allow changes in control registers to propagate from process to process. Whether the setting is reset e.g. when returning from a function depends on the calling convention on the platform you're targetting. A quick search turned up http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf, specifically see 5.1.2.1 that mentions the FPSCR register in relation to the calling convention. – mvds Dec 29 '15 at 02:33