I'm taking the performance of sqrt function on AArch64 for academic reasons. Code for Single float sqrtf function:
fsqrt s0, s0
ret
Code for Double float sqrt function:
fsqrt d0, d0
ret
I'm referring to theoretical latencies for FSQRT from here: http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf
Single sqrt seems 2x better than double.
But, while profiling I'm getting these numbers:
326 ms sqrt
82 ms sqrtf
I'm taking times for same number of cycles. From those numbers, sqrtf seems 4x better.
I'm not able find proper reason why? Not able to find proper explanations about how actually this instruction on internet.
Some info or direction on this would be really useful.