Euclidean distance using intrinsic instruction

Question

For a research project, I need to compute a lot of Euclidean distances, where certain dimensions must be selected and others discarded. In the current state of the program, the array of selected dimensions has 100-ish elements, and I compute around 2-3 million distances. My current piece of code is as follow :

float compute_distance(const float* p1, const float* p2) const
{
    __m256 euclidean = _mm256_setzero_ps();

    const uint16_t n = nbr_dimensions;
    const uint16_t aligend_n = n - n % 16;
    const float* local_selected = selected_dimensions;

    for (uint16_t i = 0; i < aligend_n; i += 16)
    {
        const __m256 r1 = _mm256_sub_ps(_mm256_load_ps(&p1[i]), _mm256_load_ps(&p2[i]));
        euclidean = _mm256_fmadd_ps(_mm256_mul_ps(r1, r1), _mm256_load_ps(&local_selected[i]), euclidean);
        const __m256 r2 = _mm256_sub_ps(_mm256_load_ps(&p1[i + 8]), _mm256_load_ps(&p2[i + 8]));
        euclidean = _mm256_fmadd_ps(_mm256_mul_ps(r2, r2), _mm256_load_ps(&local_selected[i + 8]), euclidean);
    }
    float distance = hsum256_ps_avx(euclidean);

    for (uint16_t i = aligend_n; i < n; ++i)
    {
        const float num = p1[i] - p2[i];
        distance += num * num * local_selected[i];
    }

    return distance;
}

The selected dimensions are pre-determined. I could thus pre-compute an array of __m256 to pass to a _mm256_blendv_ps instead of multiplying by 0 or 1 in the line euclidean = _mm256_fmadd_ps(_mm256_mul_ps(r1, r1), _mm256_load_ps(&local_selected[i]), euclidean);. But I'm rather a novice in intrinsic instructions, and I have not yet found a working solution.

I was wondering if you guys could have some advice, or even code suggestions, to improve the running speed of this function. As a side note, I do not have access to AVX-512 instructions.

Update: Using the first above-mentioned solution, it comes:

float compute_distance(const float* p1, const float* p2) const
{
    const size_t n = nbr_dimensions;
    const size_t aligend_n = n - n % 16;
    const unsigned int* local_selected = selected_dimensions;
    const __m256* local_masks = masks;

    __m256 euc1 = _mm256_setzero_ps(), euc2 = _mm256_setzero_ps(),
        euc3 = _mm256_setzero_ps(), euc4 = _mm256_setzero_ps();

    const size_t n_max = aligend_n/8;
    for (size_t i = 0; i < n_max; i += 4)
    {       
        const __m256 r1 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 0]), _mm256_load_ps(&p2[i * 8 + 0]));
        const __m256 r1_1 = _mm256_and_ps(r1, local_masks[i + 0]);
        euc1 = _mm256_fmadd_ps(r1_1, r1_1, euc1);

        const __m256 r2 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 8]), _mm256_load_ps(&p2[i * 8 + 8]));
        const __m256 r2_1 = _mm256_and_ps(r2, local_masks[i + 1]);
        euc2 = _mm256_fmadd_ps(r2_1, r2_1, euc2);

        const __m256 r3 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 16]), _mm256_load_ps(&p2[i * 8 + 16]));
        const __m256 r3_1 = _mm256_and_ps(r3, local_masks[i + 2]);
        euc3 = _mm256_fmadd_ps(r3_1, r3_1, euc3);

        const __m256 r4 = _mm256_sub_ps(_mm256_load_ps(&p1[i * 8 + 24]), _mm256_load_ps(&p2[i * 8 + 24]));
        const __m256 r4_1 = _mm256_and_ps(r4, local_masks[i + 3]);
        euc4 = _mm256_fmadd_ps(r4_1, r4_1, euc4);
    }

    float distance = hsum256_ps_avx(_mm256_add_ps(_mm256_add_ps(euc1, euc2), _mm256_add_ps(euc3, euc4)));

    for (size_t i = aligend_n; i < n; ++i)
    {
        const float num = p1[i] - p2[i];
        distance += num * num * local_selected[i];      
    }

    return distance;
}

Peter Cordes · Accepted Answer · 2017-08-19T22:25:37.430

Basic advice:

Don't use uint16_t for your loop counter unless you really want to force the compiler to truncate to 16 bits every time. Use at least unsigned, or you sometimes get better asm from using uintptr_t (or more conventionally, size_t). Zero-extension from 32-bit to pointer width happens for free on x86-64 just from using 32-bit operand-size asm instructions, but sometimes compilers still don't do great.

Use five or more separate accumulators instead of one euclidean, so multiple sub / FMA instructions can be in flight without bottlenecking on the latency of a loop-carried dependency chain that does FMAs into one accumulator.

FMA has a latency of 5 cycles but a throughput of one per 0.5 cycles on Intel Haswell. See also latency vs throughput in intel intrinsics, and also my answer on Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for a more-advanced version.

Avoid passing args through global variables. Apparently your n is a compile-time constant (which is good), but selected_dimensions isn't, is it? If it is, then you're only using one set of masks in your whole program, so nevermind the stuff below about compressing masks.

Using global variables can defeat compiler optimization when it inlines your function into a caller which sets a global before calling it. (Usually only if there's a non-inline function call between setting the global and using it, but that's not uncommon.)

update: your arrays are small, only ~100 elements, so unrolling by only 2 is maybe good, to reduce startup / cleanup overhead. Out-of-order execution can hide the FMA latency over this short number of iterations, especially if the final result of this function call isn't needed to decide the input parameters to the next call.

Total function-call overhead is important, not just how efficient the vectorization is for large arrays.

As discussed in comments, peeling the first iteration of the loop lets you avoid the first FMA by initializing euc1 = stuff(p1[0], p2[0]); instead of _mm256_setzero_ps().

Padding your arrays out to a full vector (or even a full unrolled loop-body of 2 vectors) with zeros lets you avoid the scalar cleanup loop entirely, and make the whole function very compact.

If you couldn't just pad, you could still avoid a scalar cleanup by loading an unaligned vector that goes right up to the end of the inputs, and mask that to avoid double-counting. (See this Q&A for a way to generate a mask based on a misalignment count). In other kinds of problems where you're writing an output array, it's fine to redo overlapping elements.

You don't show your hsum256_ps_avx code, but that's a decent fraction of the total latency and maybe throughput of your function. Make sure you optimize it for throughput: e.g. avoid haddps / _mm_hadd_ps. See my answer on Fastest way to do horizontal float vector sum on x86.

Your specific case:

I could thus pre-compute an array of __m256 to pass to a _mm256_blendv_ps instead of multiplying by 0 or 1 in the FMA.

Yes, that would be better, especially if it lets you fold something else into an FMAdd / FMSub. But even better than that, use a boolean _mm256_and_ps with all-zero or all-ones. That leaves a value unchanged (1 & x == x) or zeroed (0 & x == 0, and the binary representation of float 0.0 is all-zeros.)

If your masks aren't missing in cache, then store them fully unpacked so they can just be loaded.

If you're using different masks with the same p1 and p2, you could pre-compute p1-p2 squared, and then just do a masked add_ps reduction over that. (But note that FMA has better throughput than ADD on Intel pre-Skylake. Haswell/Broadwell have 2 FMA units, but run ADDPS on a dedicated unit with lower latency (3c vs. 5c). There's only one vector-FP add unit. Skylake just runs everything on the FMA units with 4 cycle latency.) Anyway, this means it can actually be a throughput win to use FMA as 1.0 * x + y. But you're probably fine, because you still need to load the mask and the square(p1-p2) separately, so that's 2 loads per FP add, so one per cycle keeps up with load throughput. Unless you (or the compiler) peel a few iterations at the front and keep the float data for those iterations in registers across multiple different local_selected masks.

update: I wrote this assuming that the array size was 2-3 million, rather than ~100. Profile for L1D cache misses to decide if spending more CPU instructions to reduce cache footprint is worth it. If you always use the same mask for all 3 million calls, it's probably not worth it to compress.

You could compact your masks down to 8 bits per element and load them with pmovsx (_mm256_cvtepi8_epi32) (sign-extending an all-ones value produces a wider all-ones, because that's how 2's complement -1 works). Unfortunately using it as a load is annoying; compilers sometimes fail to optimize _mm256_cvtepi8_epi32(_mm_cvtsi64x_si128(foo)) into vpmovsxbd ymm0, [mem], and instead actually use a separate vmovq instruction.

const uint64_t *local_selected = something;  // packed to 1B per element

__m256 euc1 = _mm256_setzero_ps(), euc2 = _mm256_setzero_ps(),
euc3 =  _mm256_setzero_ps(), euc4 =  _mm256_setzero_ps();

for (i = 0 ; i < n ; i += 8*4) {  // 8 floats * an unroll of 4

    __m256 mask = _mm256_castsi256_ps( _mm256_cvtepi8_epi32(_mm_cvtsi64x_si128(local_selected[i*1 + 0])) );
    // __m256 mask = _mm256_load_ps(local_selected[i*8 + 0]); //  without packing

    const __m256 r1 = _mm256_sub_ps(_mm256_load_ps(&p1[i*8 + 0]), _mm256_load_ps(&p2[i*8 + 0]));
    r1 = _mm256_and_ps(r1, mask);             // zero r1 or leave it untouched.
    euc1 = _mm256_fmadd_ps(r1, r1, euc1);    // euc1 += r1*r1
    // ... same for r2 with local_selected[i + 1]
    // and p1/p2[i*8 + 8]
    // euc2 += (r2*r2) & mask2

    // and again for euc3 (local_selected[i + 2], p1/p2[i*8 + 16]
    // and again for euc3 (local_selected[i + 3], p1/p2[i*8 + 24]
}
euclidean = hsum (euc1+euc2+euc3+euc4);

I guess you bottleneck slightly on load throughput without the pmovsx, since you have three loads for three vector ALU operations. (And with micro-fusion, it's only 4 total fused-domain uops on an Intel CPU, so it's not bottlenecked on the front-end). And the three ALU uops can run on different ports (vandps is 1 uop for port 5 on Intel pre-Skylake. On SKL it can run on any port).

Adding in a shuffle (the pmovsx) potentially bottlenecks on port5 (on Haswell/Broadwell). You might want to do use vpand for the masking so it can run on any port, if you're tuning for HSW/BDW, even if they have extra bypass latency between an integer AND and FP math instructions. With enough accumulators, you're not latency-bound. (Skylake has extra bypass latency for VANDPS depending on which port it happens to run on).

blendv is slower than AND: always at least 2 uops.

Compressing the mask even more for large arrays

If your arrays are larger than L2 cache, and your mask array has as many elements as your float arrays, you will most likely bottleneck on load bandwidth (at least once you unroll with multiple vector accumulators). This means that spending more instructions unpacking the mask data is worth it to reduce that part of the bandwidth requirement.

I think the ideal format for your mask data is with 32 vectors of masks interleaved, which makes it very cheap to "unpack" on the fly. Use a shift to bring the right mask into the high bit of each 32-bit element, and use it with vblendvps to conditionally zero elements by blending with zero. (Or with an arithmetic right shift + boolean AND)

__m256i masks = _mm256_load_si256(...);

                          // this actually needs a cast to __m256, omitted for readability
r0 = _mm256_blendv_ps(_mm256_setzero_ps(), r0, masks);
...

__m256i mask1 = _mm256_slli_epi32(masks, 1);
r1 = _mm256_blendv_ps(_mm256_setzero_ps(), r1, mask1);
...

__m256i mask2 = _mm256_slli_epi32(masks, 2);
r2 = _mm256_blendv_ps(_mm256_setzero_ps(), r2, mask2);
...

// fully unrolling is overkill; you can set up for a loop back to r0 with
masks = _mm256_slli_epi32(masks, 4);

You could also do masks = _mm256_slli_epi32(masks, 1); at every step, which might be better because it uses 1 register less. But it might be more sensitive to resource conflicts causing latency on the mask dep chain, since every mask depends on the previous one.

Intel Haswell runs both vblendvps uops on port5 only, so you could consider using _mm256_srai_epi32 + _mm256_and_ps. But Skylake can run the 2 uops on any of p015, so blend is good there (although it does tie up a vector register holding the all-zero vector).

Generate masks in that interleaved format with a packed-compare, then _mm256_srli_epi32(cmp_result, 31) and OR that into the vector you're building up. Then left-shift it by 1. Repeat 32 times.

You can still use this format if you have fewer than 32 whole vectors of data in your arrays. The lower bits will just go unused. Or you could have the masks for 2 or more selected_dimensions per vector. e.g. the top 16 bits of each element are for one selected_dimensions, and the bottom 16 bits are for another. You could do something like

__m256i masks =  _mm256_load_si256(dimensions[selector/2]);
masks = _mm256_sll_epi32(masks, 16 * (selector % 2));

// or maybe
if (selector % 2) {
    masks = _mm256_slli_epi32(masks, 16);
}

AVX512:

AVX512 can use a bitmap mask directly, so it's somewhat more efficient. Just use const __mmask16 *local_selected = whatever; to declare an array of 16-bit masks (for use with 512b vectors of 16 floats), and use r0 = _mm512_maskz_sub_ps(p1,p2, local_selected[i]); to zero-mask the subtraction.

If you actually bottleneck on load-port uop throughput (2 loads per clock), you might try loading 64 bits of mask data at once, and using a mask-shift to get a different low-16 of it. This probably won't be a problem unless your data is hot in L1D cache, though.

It's very easy to generate the mask data in the first place with a compare-into-mask, with no interleaving required.

Ideally you could cache-block the code that calls this so you could reuse data while it was hot in cache. e.g. get all the combinations you want from the first 64kiB of p1 and p2, then move on to later elements and do them while they're hot in cache.

thank you a lot ! I have one question about your proposition, shouldn't I use _mm256_loadu_si256 instead of _mm_cvtsi64x_si128 since local_selected is as long as the point p1 or p2 ? — Deadloop, Aug 17 '17 at 15:47
@Deadloop: my example is showing the `pmovsx` variation, where it's packed into byte elements to reduce cache misses. — Peter Cordes, Aug 17 '17 at 20:28
@Deadloop: compressing the mask even more is better: 1 bit per element is another factor of 8, and only slightly more expensive to use. — Peter Cordes, Aug 18 '17 at 14:31
Thank you for your multiple edits, it becomes clearer to me. I will test your suggestions, and validate your answer as soon as possible. — Deadloop, Aug 18 '17 at 14:45
Regarding your first solution, I have converted my vector of unsigned int (the selected dimensions) into a dynamic array of _m256 masks. More precisely, this is 8 dimensions (i.e. 8 unsigned int) per uint64_t and each uint64_t is first passed to a _mm256_cvtepi8_epi32 and then casted using _mm256_castsi256_ps). However, I don't get why in each mask, there are 1.401e-45#DEN instead of the 1 when I use _mm256_castsi256_ps, is this intended? — Deadloop, Aug 19 '17 at 11:57
Regarding, your second solution, I must say that you really lost me. I'm OK to have 32 bits unsigned integer representing each dimension selected. But the fact that each 32bits unsigned integer is loaded using _mm256_load_si256 and casted using _mm256_castsi256_ps produce the same effect that I described earlier. Here, I think the "Use a shift to bring the right mask into the high bit of each 32-bit element" is not clear to me. — Deadloop, Aug 19 '17 at 11:57
@Deadloop: For the pack-to-bytes version, one `uint64_t` should be something like `0xff00ff00ff00ff00` to select only the even elements. Unpacking it should produce `0xffffffff`, `0x00000000`, ... The `float` value of all-ones is a NaN. If you got a denormal, then one of us made a mistake somewhere. — Peter Cordes, Aug 19 '17 at 12:09
@Deadloop: the second version stores the blend-control for 32 `__m256` vectors in one `__m256i`. The blend-control for the first vector of 8 floats is in the high bit of each element, where you can use it directly. The blend-control for the 2nd vector of 8 floats is in the 2nd-highest bit of each element, so you can bring it to the top of each element by shifting each element left by 1 bit. This is more efficient to use than if you had the bitmap "in order" instead of interleaved. (AVX2 variable-shifts with a per-element count are 3 uops on Haswell, and need an extra vector constant.) — Peter Cordes, Aug 19 '17 at 12:14
Thanks for the clarifications! I think I must double-check my unsigned int to uint64_t conversion ;) — Deadloop, Aug 19 '17 at 12:18
@Deadloop: Remember that packing into `uint64_t` is really just packing each 32-bit `0` / `-1` into 8 bits, so when can sign-extend each byte back to a 32-bit `0` or `-1`. You could do it with 4x vector compares -> 2x `packssdw` -> `packsswb`, but beware of lane-crossing for AVX2 pack instructions. Use a shuffle (like `vpermq`) to fix the ordering of 64-bit groups of bytes since `vpmovsxbd` just goes in-order. — Peter Cordes, Aug 19 '17 at 12:23
Unfortunately, the first solution does not appear to consistently outperform my current solution. — Deadloop, Aug 19 '17 at 16:32
@Deadloop: Are you testing with big enough arrays that you see cache misses? Did you already unroll with multiple accumulators to hide FP-add latency? (Otherwise that will probably be the bottleneck if they're still coming from L3 cache.) Did you check the compiler asm output to see if it avoided extra instructions? (it's hard to get it to use `vpmovsx` as a load directly, instead of using `movq`) — Peter Cordes, Aug 19 '17 at 16:36
I update my initial post with the code you suggested in your first proposition, is this right? Concerning the testing, I compute 7 million distances in around 179 ms (on VS studio 2017, i7-4770K, /Ox) — Deadloop, Aug 19 '17 at 20:16
@Deadloop: You forgot to adjust the `%16` alignment stuff for the larger unroll factor, but otherwise ok. Oh right, you still need to handle the last few elements. That means you might need a separate copy of `local_selected` for scalar, or have scalar unpack the mask data the same way the vector loop does. You should probably have a non-unrolled vector cleanup between the main unrolled loop and the scalar cleanup, so the scalar loop only has to run 0-7 times, not 0-31 times. — Peter Cordes, Aug 19 '17 at 20:30
Or do the last up-to-7 elements with an unaligned load from the end of the vector, and mask out the overlapping elements? Or much better: if possible, have your input arrays padded out to a full vector width, with zeros in the floats and/or masks, so you don't need a scalar loop at all. — Peter Cordes, Aug 19 '17 at 20:31
@Deadloop: hmm, 179ms seems quite slow. Maybe I'm not understanding you correctly. Is this for one invocation of your function with array length = 7 million elements, producing the sum of that many point-to-point distances? Or is that for 7 million calls to the whole function with each one doing about 100 floats of work? If the latter, that might be a reasonable time. (The highly-compressed mask stuff is good for the large-array case. With small arrays, even fully-unpacked masks might stay hot in cache, but if not then packing is good.) — Peter Cordes, Aug 19 '17 at 20:50
The 179ms corresponds to 7 million calls to the whole function. Your idea of padding the array p1 and p2 to avoid the scalar loop was also on my to do list. I will check that as soon as possible. — Deadloop, Aug 19 '17 at 21:05
@Deadloop: So what's `n`? Is it a compile-time constant? If it's only ~100 or so, function-call overhead matters a lot. (And out-of-order execution can hide the FMA latency so unrolling by only 2 might make sense, to reduce the overhead of adding accumulators together.) You might also want to peel the first iteration instead of starting with `setzero()`, so you avoid the first two FMAs. i.e. start with `euc1 = (what you have now for r1_1)`. You have to check that `n` >= 16, of course, if it's possible that it might not be. — Peter Cordes, Aug 19 '17 at 21:13
@Deadloop: you should definitely measure L1D cache misses (using hardware performance counters) in your complete program (not just a microbenchmark) to help you decide whether to pack the mask or not. I've heard that VTune works well on Windows. If you can build your code on a Linux box, `perf stat -d ./my-program` is an easy way. — Peter Cordes, Aug 19 '17 at 21:16
Yes, n is only ~100. I wasn't aware of this technique (loop splitting), thank for the advice :). I have access to Intel VTune and Advisor. — Deadloop, Aug 19 '17 at 21:24
@Deadloop: Terminology: Fully unrolling the first one or more iterations at the start or end is called "loop peeling". I'm not sure if "loop splitting" has a well-understood specific meaning, but it sounds more like what Intel's optimization manual and some other documents call "loop fission": e.g. split `for() { a[i] = i; b[i] = 2; }` into two separate loops. (Useful for over-complicated loops with too many output streams). — Peter Cordes, Aug 19 '17 at 21:36
@Deadloop: my question about `n` was more specific than that. I asked if it was a compile-time constant, or if it's really a run-time variable that you pass as a global. (Don't do that, it can defeat optimization when inlining the function call into the code that set the global.) If you can pad easily, then it doesn't matter much whether it's exactly 96 vs. 95, but with your current version 95 will be significantly slower than 96, because the scalar loop runs many iterations instead of 0. — Peter Cordes, Aug 19 '17 at 21:39

Euclidean distance using intrinsic instruction

1 Answers1

Compressing the mask even more for large arrays

AVX512: