Performance of sqrt function on AArch64

Question

I'm taking the performance of sqrt function on AArch64 for academic reasons. Code for Single float sqrtf function:

fsqrt s0, s0 
ret

Code for Double float sqrt function:

fsqrt d0, d0 
ret

I'm referring to theoretical latencies for FSQRT from here: http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf

Single sqrt seems 2x better than double.

But, while profiling I'm getting these numbers:

326 ms  sqrt
 82 ms  sqrtf

I'm taking times for same number of cycles. From those numbers, sqrtf seems 4x better.

I'm not able find proper reason why? Not able to find proper explanations about how actually this instruction on internet.

Some info or direction on this would be really useful.

remember to change the alignment of the instruction and measure again, repeat as needed. If you are trying to measure a single instance of the instruction, it is unlikely you will be successful. — old_timer, Jan 23 '17 at 13:50
@old_timer I'm measuring the performance for a million function calls in a loop. That should not be the problem. — Vikram Dattu, Jan 24 '17 at 08:50

Kyrill · Accepted Answer · 2018-08-28T08:55:47.793

If you look at the note attached to the table entries for the FSQRT instruction in the Cortex-A57 optimization guide, it says that the "FP divide and square root operations are performed using an iterative algorithm".

That means that depending on the input to the instruction, the latency will vary. That is the meaning of the "7-17" and "7-32" latency numbers in the table. Depending on the input the single-precision FSQRT can take between 7 and 17 cycles to complete whereas the double-precision variant can take between 7 and 32 cycles.

So if a particular single-precision computation happens to take 7 cycles but a double precision computation takes, say, 28 cycles you have a 4x disparity.

Performance of sqrt function on AArch64

1 Answers1

Linked