0

I'm pretty new to AVX (and C!) and I'm trying to calculate the euclidean distance (squared) between two vectors and return a vector filled with 1 if the distance is less than some threshold and 0 if it is greater.

For instance, if the distances are [5.0, 6.0, 2.0, 1.0] and the threshold is 4.0, I would like the function to return [1.0, 1.0, 0.0, 0.0]. The code below is what I have so far (adapted a little from AVX2 float compare and get 0.0 or 1.0 instead of all-0 or all-one bits). It works but leaves a lot to desire.

__m256d diff, value, ta, tb, tc, tta, ttb, ttc, sum, comp, mask;
comp = _mm256_set1_pd(4.0); //this is what I want to compare the distance to

ta = _mm256_sub_pd(v1[0], v2[0]); 
tb = _mm256_sub_pd(v1[1], v2[1]);
tc = _mm256_sub_pd(v1[2], v2[2]);

tta = _mm256_mul_pd(ta,ta); //(v1.x - v2.x)^2
ttb = _mm256_mul_pd(tb,tb); //(v1.y - v2.y)^2
ttc = _mm256_mul_pd(tc,tc); //(v1.z - v2.z)^2

sum = _mm256_add_pd(_mm256_add_pd(tta,ttb), ttc); //(v1.x - v2.x)^2 + (v1.y - v2.y)^2 + (v1.z - v2.z)^2

mask = _mm256_cmp_pd(sum, comp, _CMP_LE_OS); // will be NaN or 0 

value = _mm256_div_pd(_mm256_min_pd(mask, comp), comp); 

For a calculated distance of [5.0, 6.0, 2.0, 1.0], the _mm256_cmp_pd will return [4.0, 4.0, 0.0, 0.0] when compared to 4.0 (copied from the linked StackOverflow post), and then I divide by 4.0 to set it to 1.0. This obviously seems like a pretty hacky way to get what I want; is there an easier way to compare the "sum" and "comp" to gets 1's and 0's directly?

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
brokenseas
  • 310
  • 1
  • 12
  • 4
    If you only have 2 possible values per element, you can avoid using a slow `div`. Use `_mm256_and_pd` with a vector of `1.0` to zero it or not. `cmp_pd` produces an all-0 or all-1 result so you can use it as an AND mask. – Peter Cordes May 26 '20 at 19:33
  • 1
    This worked — thank you. – brokenseas May 26 '20 at 20:00
  • Quick, related follow-up @PeterCordes: if I wanted to do the same operation with AVX512, what would be the correct approach? Since _mm512_cmp_pd_mask returns a mask register, I've been a little stuck on how to convert that to a vector of 1's and 0's. – brokenseas May 26 '20 at 20:45
  • 2
    Ideally just use the mask for zero-masking or merge-masking the next operation you do on the result. Or you could use the mask with a blend (https://www.felixcloutier.com/x86/vblendmpd:vblendmps) or a zero-masked move (`_mm512_maskz_mov_pd` / https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX_512&expand=3585,3826&text=_mm512_maskz_mov) to actually materialize a vector of 0.0 / 1.0. – Peter Cordes May 26 '20 at 20:48
  • Great! Thank you again for your helpful comments. I really appreciate it. – brokenseas May 26 '20 at 21:18
  • 2
    Also, if you have AVX512 you also have FMA instead of needing separate mul / add for the last 2 products. AVX often comes with FMA, but there are some AVX CPUs without FMA (SnB / Ivy Bridge, first-gen Bulldozer, and one Via Nano even has AVX2 without FMA). – Peter Cordes May 26 '20 at 21:21
  • Some compilers will contract mul+add into FMA for you, some won't (especially without -ffast-math) – Peter Cordes May 26 '20 at 21:31

0 Answers0