Says compilers like LLVM have been able to do complicated auto vectorization for us nowadays. Therefore we don't need to always hand write SIMD intrinsics by ourselves.
But there's another problem that as soon as we distribute the software(generally in binary or library), we have to consider the CPU architecture(not only architecture, but also microarchitecture, since some old x64
CPUs don't support AVX
). Which means users may need to choose the correct version according to specific CPU model, to gain expected performance.
I'd like to know if there is an practical approach to do vectorization on different CPUs, including coding and distributing.