I'm not aware of any trick for doing this with only 2 or fewer instructions. (And the SSE version of this question doesn't have anything better either: Compute the absolute difference between unsigned integers using SSE). It does mention the saturating method I used in this answer.
Slightly better on pre-Skylake: subtract both ways with unsigned saturation, then OR the results. (Either a-b or b-a saturates to zero for each element.)
_mm256_or_si256(_mm256_subs_epu8(a,b), _mm256_subs_epu8(b,a))
On Haswell, pmin
/pmax
and psub
only run on port 1 or port 5, but por
can run on any of the three vector execution ports (0, 1, 5).
Skylake adds a 3rd vector-integer adder so there's no difference on that uarch. (See http://agner.org/optimize/ and other links in the x86 tag wiki, including Intel's optimization manual.)
This is also slightly better on Ryzen, where VPOR
can run on any of P0123, but PADD
/PMIN
can only run on P013 according to Agner Fog's testing. (Ryzen splits 256b vector ops into 2 uops, but it has the throughput for that to be useful. It can't fill its 6-uop wide pipe using only single-uop instructions.)
Uops that can run on more ports are less likely to be delayed waiting for their assigned port (resource conflict), so you're more likely to actually get 2 cycle total latency with this (from both inputs being ready to the output being ready). They're also less likely to contribute to a throughput bottleneck if there's competition for a specific port (like port 5 which has the only shuffle unit on Intel Haswell and later).