Intel hasn't ever removed instruction sets in future versions of the same CPU. i.e. a binary that works on an old Intel CPU always works on a newer Intel CPU.
(The one exception to this is first-gen Xeon Phi: Knight's Corner used an incompatible variant of AVX512 called KNI, but later Xeon Phi accelerator cards / computers use AVX512.)
If you must use the same binary on all CPUs, use gcc -march=sandybridge -mtune=haswell
, and make sure your important arrays are aligned by 32 bytes.
Maybe worth benchmarking with gcc -march=sandybridge
(i.e. with tune=sandybridge) as well, to see which works better for your code. -mprefer-avx128
or -mprefer-vector-width=256
might be interesting to try: some loops get messy when gcc auto-vectorizes with 256-bit vectors.
SnB/IvB have inefficient misaligned AVX loads/stores, so tune=sandybridge sets -mavx256-split-unaligned-load
, which sucks a lot if your data is aligned at runtime but the compiler didn't know that. The extra instructions and shuffles aren't helpful on Haswell, so -mtune=haswell
includes -mno-avx256-split-unaligned-load
.
Unfortunately gcc doesn't have a "tune=avx2" option to tune for all CPUs which have AVX2, or an option to tune for the average CPU which supports the instruction sets you enabled. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568. Your only choices are tune for a specific CPU, or tune for the generic baseline, or use specific tuning options.
Gcc does has some support for runtime dispatch with ifunc
You have to activate it in the source for specific functions. See https://lwn.net/Articles/691932/ for more about function multi-versioning.
Best option: build separate binaries for SnB / Haswell, and dispatch with a script or $PATH
setting
On each cluster node, create a /etc/host-type
or whatever, which has sandybridge
or haswell
or whatever. Any per-node filesystem is fine, or re-detect it at run time with gcc
or something cheaper. In your job script:
#!/bin/sh
bin_dir="./bin-$(</etc/node-type)"
exec "$bin_dir/my_prog" "$@"
Create symlinks as necessary to make bin-skylake
and bin-broadwell
use the Haswell binaries.
Haswell introduced AVX2 and FMA, and BMI1/2. If you're number-crunching, you really want FMA. BDW/SKL didn't introduce any significant ISA extensions that compilers can use to make your code run faster. Tuning for BDW/SKL is not different either.
If you have any Skylake-avx512 CPUs, that's different.