Practical approach to do vectorization on different CPUs

Question

Says compilers like LLVM have been able to do complicated auto vectorization for us nowadays. Therefore we don't need to always hand write SIMD intrinsics by ourselves.

But there's another problem that as soon as we distribute the software(generally in binary or library), we have to consider the CPU architecture(not only architecture, but also microarchitecture, since some old x64 CPUs don't support AVX). Which means users may need to choose the correct version according to specific CPU model, to gain expected performance.

I'd like to know if there is an practical approach to do vectorization on different CPUs, including coding and distributing.

use automatic dispatching. Either use the built-in multiversioning feature of the stdlib/compiler or dispatch the correct version of the functions yourself at runtime [Building backward compatible binaries with newer CPU instructions support](https://stackoverflow.com/q/61005492/995714) — phuclv, Aug 11 '21 at 05:35
Terminology: Supporting AVX or not *is* an architectural difference. It's not just an implementation / internal-state detail; the existence of YMM registers (and VEX encodings of SIMD instructions to use them) is software-visible and thus architectural. New microarchitectures can introduce new architectural features / extensions. — Peter Cordes, Aug 11 '21 at 08:20

Practical approach to do vectorization on different CPUs

0 Answers0