1

I have the following loop that takes the square root of each entry in an array:

#include <mmintrin.h>

float array[SIZE];
for (int i = 0; i < SIZE; i += 4)
{
    __m128 fourFloats, fourRoots;
    fourFloats = _mm_load_ps(&array[i]);
    fourRoots = _mm_sqrt_ps(fourFloats);
    float results[4];
    _mm_store_ps(results, fourRoots);
    // This is bottleneck
    array[i]   = results[0] > 63.0F ? 63.0F : floor(results[0]);
    array[i+1] = results[1] > 63.0F ? 63.0F : floor(results[1]);
    array[i+2] = results[2] > 63.0F ? 63.0F : floor(results[2]);
    array[i+3] = results[3] > 63.0F ? 63.0F : floor(results[3]);
    // This is slower
//  array[i] = (int) std::min(floor(results[0]), 63.0F);
}

According to my profiler (Zoom) the square roots take no significant amount of time, but each of the four clipping of the results take about 20% of the time each, even with -O2 optimisation on. Is there a more efficient way to implement the loop? Note that _mm_store_ps() gets optimised out by gcc.

I tried an optimised table lookup of the square root as 97% of the input array values are under 512, but that didn't help. Note that this routine takes just under a quarter of the total processor time for my complete application, a constantly-running image recognition application.

Ken Y-N
  • 14,644
  • 21
  • 71
  • 114
  • Have you tried to use `_mm_floor_ps` on the `fourRoots` before storing it into `results` and later make this conditional stuff you have? – ixSci Feb 10 '16 at 06:11
  • 1
    BTW, there are SSE commands to process 8 floats at a time – ixSci Feb 10 '16 at 06:12
  • @ixSci I didn't realise there was such a function! I've now looked it up, and SSE4 is from 2007 or so, and since our target will be i7 or so devices, I have no qualms about using this instruction. – Ken Y-N Feb 10 '16 at 06:22
  • corrected my answer below – zx485 Feb 10 '16 at 06:24
  • 1
    You should have used `gcc -O3` to enable autovectorization. I'm not surprised converting to `int` and back is slow, esp. if you're doing it scalar, instead of with `_mm_cvttps_epi32` (the extra `t` means truncation.) If you want the result to eventually be `float`, though, it is certainly better to use `floor_ps` and clamp to 63.0f with `min_ps`. This needs SSE4.1, so you need 2nd-gen core2. (Or AVX for 256b vectors, but that requires Intel Sandybridge / AMD Bulldozer. First-gen i7 (Nehalem/Westmere), is are old but not gone, and not even totally obsolete.) – Peter Cordes Feb 10 '16 at 09:50
  • Also note that Intel leaves out AVX support in their lowest-end desktop chips (Celeron). This is really annoying, and delays the point where AVX / AVX2 will be baseline. Runtime CPU detection isn't too hard, so you can do that and set a function pointer to the best version of your function that works on the current host. – Peter Cordes Feb 10 '16 at 09:53
  • possible duplicate: [What is the instruction that gives branchless FP min and max on x86?](https://stackoverflow.com/q/40196817) – Peter Cordes Dec 02 '22 at 11:23

3 Answers3

2

There are MAXPS and MINPS:

__m128d _mm_max_ps(__m128d a, __m128d b);

Performs an SIMD compare of the packed single-precision floating-point values in the first source operand and the second source operand and returns the maximum value for each pair of values to the destination operand.

and

__m128d _mm_min_ps(__m128d a, __m128d b);

Performs an SIMD compare of the packed single-precision floating-point values in the first source operand and the second source operand and returns the minimum value for each pair of values to the destination operand.

Use an XMM register with four 63.0f values as the second operand.

Community
  • 1
  • 1
zx485
  • 28,498
  • 28
  • 50
  • 59
  • `_mm_min_ps()` is what I'm looking for, so combining the two replies I've knocked 25% off the total runtime, and this function's contribution has dropped from about 24% to 7%. Thank you. – Ken Y-N Feb 10 '16 at 06:29
  • 1
    @KenY-N, how about using `__m256` registers(and correspodning commands) also? I think it will improve your experience even more. – ixSci Feb 10 '16 at 06:31
  • @ixSci That's knocked another few percent off, and after adding `-mavx` to my compile line, since `SIZE` is 40, it also unrolled the inner loop. I suppose `__m512` would be yet better. – Ken Y-N Feb 10 '16 at 07:01
2

Given you can use a pretty modern CPU I'd start with this:

float array[SIZE];
for(int i = 0; i < SIZE; i += 8)
{
    __m256 eightFloats, eightRoots;
    eightFloats = _mm256_load_ps(&array[i]);
    eightRoots = _mm256_sqrt_ps(eightFloats);
    float results[8];
    eightRoots = _mm256_floor_ps(eightRoots);
    _mm256_store_ps(results, eightRoots);
    ...
}

Or even would go for 512 versions if the most advanced SIMD instructions are allowed.

ixSci
  • 13,100
  • 5
  • 45
  • 79
  • There is no AVX512 hardware, other than Intel's compute cards (Xeon Phi). Skylake-E cores aren't out yet. Any current "skylake xeon" chips are still the desktop core without AVX512. – Peter Cordes Feb 10 '16 at 09:56
1

To summarise the two answers, this is the code I eventually decided on for my full requirement, array[i] = std::min(floor(sqrt(array[i])), (float) 0x3f);

float array[SIZE];
const float clipValue = (float) 0x3f;
const float clipArray[8] = {clipValue, clipValue, clipValue, clipValue,
                            clipValue, clipValue, clipValue, clipValue};
__m256 eightClips = _mm256_load_ps(clipArray);
for(int i = 0; i < SIZE; i += 8)
{
    __m256 eightFloats = _mm256_load_ps(&array[i]);
    __m256 eightRoots  = _mm256_sqrt_ps(eightFloats);
    __m256 eightFloors = _mm256_floor_ps(eightRoots);
    __m256 eightMins   = _mm256_min_ps(eightFloors, eightClips);
    _mm256_store_ps(&array[i], eightMins);
}

I'm targeting specific hardware in a vertical app, so an AVX-compatible processor is a given.

Ken Y-N
  • 14,644
  • 21
  • 71
  • 114
  • 3
    Just use `_mm256_set1_ps(63.0f)`. If `clipArray` is a local, your compiler might actually emit code that stores clipValue to the stack 8 times! See [my answer on vector constants](http://stackoverflow.com/a/35297579/224132) for example. You might want to try to get your compiler to emit a `vbroadcastss` load from a single static const `float`, if it doesn't turn the `set1` into that already. Clang should, gcc probably won't. `63.0f` has a bit pattern that would take more than a couple instructions to [gen on the fly with shifts](http://stackoverflow.com/q/35085059/224132) :/ – Peter Cordes Feb 12 '16 at 02:29
  • 1
    @Peter Cordes:An alternative for `_mm256_set1_ps(63.0f)` is `clipValuei32=_mm256_undefined_si256(); clipValuei32=_mm256_cmpeq_epi32(clipValuei32,clipValuei32); clipValuei32=_mm256_srli_epi32(clipValuei32,26); clipValue=_mm256_cvtepi32_ps(clipValuei32); ` , which might be faster if cache misses are a problem. It compiles to three instructions, each of them with only 0.5 cycle throughput on Skylake. – wim Feb 13 '16 at 11:36
  • 1
    @wim: oh good point, integer 63 is easy to generate. Your best bet is usually to start with `set1_epi32(-1)`, since some compilers suck at handling `mm_undefined_*()`. Except on MSVC, where you get a 256b load of the all-ones constant, instead of a `vpcmpeq`. Further testing is required to see if MSVC always inserts a pxor dependency-breaker for mm_undefined (not needed because vpcmpeq same,same is already a dep breaker). – Peter Cordes Feb 13 '16 at 14:42