I have the following loop that takes the square root of each entry in an array:
#include <mmintrin.h>
float array[SIZE];
for (int i = 0; i < SIZE; i += 4)
{
__m128 fourFloats, fourRoots;
fourFloats = _mm_load_ps(&array[i]);
fourRoots = _mm_sqrt_ps(fourFloats);
float results[4];
_mm_store_ps(results, fourRoots);
// This is bottleneck
array[i] = results[0] > 63.0F ? 63.0F : floor(results[0]);
array[i+1] = results[1] > 63.0F ? 63.0F : floor(results[1]);
array[i+2] = results[2] > 63.0F ? 63.0F : floor(results[2]);
array[i+3] = results[3] > 63.0F ? 63.0F : floor(results[3]);
// This is slower
// array[i] = (int) std::min(floor(results[0]), 63.0F);
}
According to my profiler (Zoom) the square roots take no significant amount of time, but each of the four clipping of the results take about 20% of the time each, even with -O2
optimisation on. Is there a more efficient way to implement the loop? Note that _mm_store_ps()
gets optimised out by gcc
.
I tried an optimised table lookup of the square root as 97% of the input array
values are under 512, but that didn't help. Note that this routine takes just under a quarter of the total processor time for my complete application, a constantly-running image recognition application.