3

There is a comparison:

if( val0 > val1 )

where val0 and val1 are double variables.

The code generated by the Apple LLVM compiler is

+0x184  vcmpe.f64                      d17, d16
+0x188  vmrs                           APSR_nzcv, fpscr <-- FP status transfer (30 cycles stall of ALU)
+0x18c  ble.w                          .....

Is there any way to avoid this kind of transfer?

[UPDATE] The code is running on the Cortex-A8 processor.

Alex
  • 9,891
  • 11
  • 53
  • 87
  • 2
    How about comparing them as *sign and magnitude* integers? It's [possible with IEEE-754](http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers). – Alexey Frunze Dec 29 '11 at 09:17
  • @Alex, maybe you explain first why do you need this? The comparison via parts is definitely slower and longer. – Anton Korobeynikov Dec 29 '11 at 12:52
  • 1
    Where did you find the 30 cycle figure? – Stephen Canon Dec 29 '11 at 13:37
  • I doubt it is the flag transfer it is more likely that you are waiting for the vcmpe to complete before the vmrs instruction can execute, the vmrs is stalled. If you have other things you want to do in the mean time, do them then perform the vmrs and ble after. – old_timer Dec 29 '11 at 14:40
  • @Anton Korobeynikov: this is a frequently called section of the code and because of it I wanted to find a way to reduce the stall. – Alex Dec 29 '11 at 16:26
  • @Stephen Canon: got this number in the profiler – Alex Dec 29 '11 at 16:28
  • @dwelch: Actually I get the same stall in the case of float->int conversion, thus I suppose that it is caused by the register transfer from co-processor to the processor. – Alex Dec 29 '11 at 16:31
  • @Alex my understanding is the fpu operates in parallel. Whenever you need to synchronize with the cpu be it a flag copy or float to int in a cpu register, etc, the cpu will stall if the fpu has not completed the math. take the code you provided, time it very accurately. put say a dozen (cpu) nops in between the vcmpe and vmrs, and time it again very accurately, if it takes the same amount of time for each the stall is in the vmrs. if it is a dozen plus fetches number of cycles longer then stall is in the vcmpe. (unless your time measurement is bad, etc which is why I say VERY accurately) – old_timer Dec 29 '11 at 19:56
  • turn the caches off, execute from ram, etc to make that measurement. being double I assume this is a cortex-A of some sort? I have a cortex-m with an fpu but it is single precision, similar or same fpu I could try something like this if you wish. – old_timer Dec 29 '11 at 19:57
  • 2
    @Alex, according to Cortex-A9 docs vcmpe.f64 has the biggest latency of 5 (1 instruction latency and 4 is the result writeback), vmrs is free, it does not have latency at all. However, vcmpe.f64 is VFP instruction, and VFP unit is not pipelined on A8/A9 at all and shares resources with NEON. So, if you have some NEON code before vcmpp, then the stall might be caused by VFP-NEON domain change. – Anton Korobeynikov Dec 30 '11 at 09:39
  • [Difference between Cortex-A8 and Cortex-A9](http://forums.arm.com/index.php?/topic/14277-difference-between-cortex-a8-and-cortex-a9/) – Alex Jan 01 '12 at 10:51
  • @AntonKorobeynikov I'm sorry but I didn't noticed that this code works on Cortex-A8 processor. If you can, please provide me a link for the document that you mentioned above. Thanks. – Alex Jan 26 '12 at 07:20

1 Answers1

0

As it seems it's impossible to avoid flags transition as with the code flow management deals ARM part of the processor, not Neon co-processor. Question is closed.

Alex
  • 9,891
  • 11
  • 53
  • 87