AVX implementation of Euclidean Distance and Compare to Threshold

Question

I'm pretty new to AVX (and C!) and I'm trying to calculate the euclidean distance (squared) between two vectors and return a vector filled with 1 if the distance is less than some threshold and 0 if it is greater.

For instance, if the distances are [5.0, 6.0, 2.0, 1.0] and the threshold is 4.0, I would like the function to return [1.0, 1.0, 0.0, 0.0]. The code below is what I have so far (adapted a little from AVX2 float compare and get 0.0 or 1.0 instead of all-0 or all-one bits). It works but leaves a lot to desire.

__m256d diff, value, ta, tb, tc, tta, ttb, ttc, sum, comp, mask;
comp = _mm256_set1_pd(4.0); //this is what I want to compare the distance to

ta = _mm256_sub_pd(v1[0], v2[0]); 
tb = _mm256_sub_pd(v1[1], v2[1]);
tc = _mm256_sub_pd(v1[2], v2[2]);

tta = _mm256_mul_pd(ta,ta); //(v1.x - v2.x)^2
ttb = _mm256_mul_pd(tb,tb); //(v1.y - v2.y)^2
ttc = _mm256_mul_pd(tc,tc); //(v1.z - v2.z)^2

sum = _mm256_add_pd(_mm256_add_pd(tta,ttb), ttc); //(v1.x - v2.x)^2 + (v1.y - v2.y)^2 + (v1.z - v2.z)^2

mask = _mm256_cmp_pd(sum, comp, _CMP_LE_OS); // will be NaN or 0 

value = _mm256_div_pd(_mm256_min_pd(mask, comp), comp);

For a calculated distance of [5.0, 6.0, 2.0, 1.0], the _mm256_cmp_pd will return [4.0, 4.0, 0.0, 0.0] when compared to 4.0 (copied from the linked StackOverflow post), and then I divide by 4.0 to set it to 1.0. This obviously seems like a pretty hacky way to get what I want; is there an easier way to compare the "sum" and "comp" to gets 1's and 0's directly?

If you only have 2 possible values per element, you can avoid using a slow `div`. Use `_mm256_and_pd` with a vector of `1.0` to zero it or not. `cmp_pd` produces an all-0 or all-1 result so you can use it as an AND mask. — Peter Cordes, May 26 '20 at 19:33
Quick, related follow-up @PeterCordes: if I wanted to do the same operation with AVX512, what would be the correct approach? Since _mm512_cmp_pd_mask returns a mask register, I've been a little stuck on how to convert that to a vector of 1's and 0's. — brokenseas, May 26 '20 at 20:45
Ideally just use the mask for zero-masking or merge-masking the next operation you do on the result. Or you could use the mask with a blend (https://www.felixcloutier.com/x86/vblendmpd:vblendmps) or a zero-masked move (`_mm512_maskz_mov_pd` / https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX_512&expand=3585,3826&text=_mm512_maskz_mov) to actually materialize a vector of 0.0 / 1.0. — Peter Cordes, May 26 '20 at 20:48
Great! Thank you again for your helpful comments. I really appreciate it. — brokenseas, May 26 '20 at 21:18
Also, if you have AVX512 you also have FMA instead of needing separate mul / add for the last 2 products. AVX often comes with FMA, but there are some AVX CPUs without FMA (SnB / Ivy Bridge, first-gen Bulldozer, and one Via Nano even has AVX2 without FMA). — Peter Cordes, May 26 '20 at 21:21
Some compilers will contract mul+add into FMA for you, some won't (especially without -ffast-math) — Peter Cordes, May 26 '20 at 21:31

AVX implementation of Euclidean Distance and Compare to Threshold

0 Answers0