Arm v7 instruction set provides a fast instruction for inverse reciprocal square root calculation vrsqrte_f32
for two simultaneous approximations and vrsqrteq_f32
for four approximations. (The scalar variant vrsqrtes_f32
is only available on Arm64 v8.2).
Then the result can be simply calculated by x * vrsqrte_f32(x);
, which has better than 0.33% relative accuracy over the whole range of positive values x. See https://www.mdpi.com/2079-3197/9/2/21/pdf
ARM NEON instruction FRSQRTE gives 8.25 correct bits of the result.
At x==0
vrsqrtes_f32(x) == Inf, so x*vrsqrtes_f32(x) would be NaN.
If the value of x==0 is unavoidable, the optimal two instruction sequence needs a bit more adjustment:
float sqrtest(float a) {
// need to "transfer" or "convert" the scalar input
// to a vector of two
// - optimally we would not need an instruction for that
// but we would just let the processor calculate the instruction
// for all the lanes in the register
float32x2_t a2 = vdup_n_f32(a);
// next we create a mask that is all ones for the legal
// domain of 1/sqrt(x)
auto is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
// calculate two reciprocal estimates in parallel
float32x2_t a2est = vrsqrte_f32(a2);
// we need to mask the result, so that effectively
// all non-legal values of a2est are zeroed
a2est = vand_u32(is_legal, a2est);
// x * 1/sqrt(x) == sqrt(x)
a2 = vmul_f32(a2, a2est);
// finally we get only the zero lane of the result
// discarding the other half
return vget_lane_f32(a2, 0);
}
Surely this method will have almost twice the throughput with
void sqrtest2(float &a, float &b) {
float32x2_t a2 = vset_lane_f32(b, vdup_n_f32(a), 1);
float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
float32x2_t a2est = vrsqrte_f32(a2);
a2est = vand_u32(is_legal, a2est);
a2 = vmul_f32(a2, a2est);
a = vget_lane_f32(a2,0);
b = vget_lane_f32(a2,1);
}
And even better, if you can work directly with float32x2_t
or float32x4_t
inputs and outputs.
float32x2_t sqrtest2(float32x2_t a2) {
float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
float32x2_t a2est = vrsqrte_f32(a2);
a2est = vand_u32(is_legal, a2est);
return vmul_f32(a2, a2est);
}
This implementation gives sqrtest2(1) == 0.998
and sqrtest2(400) == 19.97
(tested on MacBook M1 with arm64). Being branchless and LUT free, this has likely a constant execution time, assuming that all the instructions execute in constant number of cycles.