3

I'm new to AVX (came from ARM NEON), and was unpleasantly surprised that AVX lacks many U8 arithmetics, absolute difference being among them missing.

Hence I had to resort to max(a,b)-min(a,b) with the inline function:

static inline __m256i _mm256_abd_epu8(__m256i a, __m256i b)
{
    return _mm256_sub_epi8(_mm256_max_epu8(a, b), _mm256_min_epu8(a, b));
}

I'm curious if there are more efficient ways dealing with this problem.

And yes, I'm aware of _mm256_sad_epu8, but I need the differences themselves, NOT the sum.

I'd appreciate any input, and it's ok with AVX2, disregarding any backward compatibility.

Thanks in advance.

Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25

1 Answers1

5

I'm not aware of any trick for doing this with only 2 or fewer instructions. (And the SSE version of this question doesn't have anything better either: Compute the absolute difference between unsigned integers using SSE). It does mention the saturating method I used in this answer.


Slightly better on pre-Skylake: subtract both ways with unsigned saturation, then OR the results. (Either a-b or b-a saturates to zero for each element.)

_mm256_or_si256(_mm256_subs_epu8(a,b), _mm256_subs_epu8(b,a))

On Haswell, pmin/pmax and psub only run on port 1 or port 5, but por can run on any of the three vector execution ports (0, 1, 5).

Skylake adds a 3rd vector-integer adder so there's no difference on that uarch. (See http://agner.org/optimize/ and other links in the tag wiki, including Intel's optimization manual.)

This is also slightly better on Ryzen, where VPOR can run on any of P0123, but PADD/PMIN can only run on P013 according to Agner Fog's testing. (Ryzen splits 256b vector ops into 2 uops, but it has the throughput for that to be useful. It can't fill its 6-uop wide pipe using only single-uop instructions.)

Uops that can run on more ports are less likely to be delayed waiting for their assigned port (resource conflict), so you're more likely to actually get 2 cycle total latency with this (from both inputs being ready to the output being ready). They're also less likely to contribute to a throughput bottleneck if there's competition for a specific port (like port 5 which has the only shuffle unit on Intel Haswell and later).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Wow, I cannot even imagine a more comprehensive answer. Exactly what I needed. Thank you. – Jake 'Alquimista' LEE Oct 27 '17 at 08:15
  • @Jake'Alquimista'LEE: cheers, I thought you'd appreciate the microarchitectural stuff, so I went into more detail than I otherwise might have. – Peter Cordes Oct 27 '17 at 08:17
  • Absolutely!!!! Now I realize that I have to dig deeper into pipelines, and the link you gave is godsend. – Jake 'Alquimista' LEE Oct 27 '17 at 08:22
  • @Jake'Alquimista'LEE: Some of my other answers have some good stuff. e.g. https://stackoverflow.com/questions/45113527/why-does-mulss-take-only-3-cycles-on-haswell-different-from-agners-instruction. Also some of my highest-voted answers, especially https://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466 – Peter Cordes Oct 27 '17 at 08:28