I want to calculate the magnitude and the angle of 4 points using neon instructions SIMD and arm assembly. There is a built in library in most languages, C++ in my case, which calculates the angle (atan2) but for only one pair of floating point variables (x and y). I would like to exploit SIMD instructions that deal with q registers in order to calculate atan2 for a vector of 4 values.
The accuracy is required not to be high, the speed is more important.
I already have a few assembly instructions which calculate the magnitude of 4 floating-point registers, with acceptable accuracy for my application. q1 contains 4 "x" values (x1, x2, x3, x4). q2 contains 4 "y" values (y1, y2, y3, y4). q7 contains the magnitude of the 4 results (x1^2 + y1^2, x2^2 + y2^2, x3^2 + y3^2, x4^2 + y4^2).
vmul.f32 q7, q1, q1
vmla.f32 q7, q2, q2
vrecpe.f32 q7, q7
vrsqrte.f32 q7, q7
What is the fastest way to calculate an approximate atan2 for two vectors using SIMD instructions?