7

Most of the hardware I uses supports SSE2 these days. On Windows and Linux, I have some code to test SSE support. I read somewhere that macOS has supported SSE for a long time, but I don't know the minimum version that can be enabled. The final binary will be copied to other macOS platforms so I cannot use -march=native like with GCC.

If it is enabled by default on all builds, do I have to pass -msse or -msse2 flags when building my code ?

Here is my compiler version:

Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix

Here is the output of uname -a

uname -a
Darwin mme.local 14.1.0 Darwin Kernel Version 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64

Here is the output of sysctl machdep.cpu.features

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64 MON DSCPL VMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 POPCNT
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
rkm
  • 892
  • 2
  • 16
  • 32
  • 1
    If you're using Xcode then see the `CLANG_X86_VECTOR_INSTRUCTIONS` build setting. I think you can safely assume SSE3 (or even SSSE3) minimum on anything that is new enough to run OS X (i.e. anything from the last 10-15 years). – Paul R Aug 28 '17 at 11:10
  • @PaulR Thanks for information. I am not using XCode project. From you comment, I can assume sse2. – rkm Aug 28 '17 at 12:47

1 Answers1

12

SSE2 is enabled by default for x86-64, because it's a required part of the x86-64 ISA.

Since Apple has never sold any AMD or Pentium4 CPUs, x86-64 on OS X also implies SSSE3 (first-gen Core2). The first x86 Macs were Core (not Core2), but they were 32-bit only. You unfortunately can't assume SSE4.1 or -mpopcnt.

I'd suggest -march=core2 -mtune=haswell. (-mtune doesn't affect compatibility, and Haswell tuning shouldn't be bad for actual Core2 or Nehalem hardware. See http://agner.org/optimize/ and links in the tag wiki for microarchitecture details about what things in (compiler-generated) assembly language are fast or slow on different CPUs.).

(See How does mtune actually work? for an example of different tuning causing different instruction selection without changing the required ISA extensions.)

-march=core2 enables everything that core2 supports, not just SSSE3. Since you don't care about your code performing well on AMD CPUs (because it's OS X), you can tune for an Intel CPU. There's also -mtune=intel which is more generic, but Haswell should be reasonable.

You might be missing out on support for Hackintosh systems where someone installed OS X on an ancient CPU on non-Apple hardware, but IDK if OS X would work on an AMD Athlon64 / PhenomII, or Intel P4.

It would be nice to be able to enable some Nehalem stuff like -mpopcnt, but Core 2 first and 2nd gen (Conroe and Penryn) lacked that. Even SSE4.1 isn't available on first-gen Core 2.


It's also possible to build a fat binary with baseline and Haswell slices, x86_64 and x86_64h. Stephen Canon says (in comments below) that "the x86_64h slice will run automatically on Haswell and later µarches". (Slices for other uarches aren't currently an option, but most programs would get little benefit.)

Your x86_64 (non-Haswell) slice should probably build with -march=core2 -mtune=sandybridge.

Haswell introduced AVX2, FMA, and BMI2, so -march=haswell is a very nice for Broadwell / Skylake / Kaby Lake / Coffee Lake. (For tuning options as well as ISA extensions: gcc -march=haswell disables -mavx256-split-unaligned-load and store, while -mavx + tune=default or sandybridge enables it. It sucks on Haswell especially when it creates shuffle-port bottlenecks. And it's really dumb when your data is almost always aligned, or really always but you just didn't tell the compiler about it.

Broadwell introduced ADOX/ADCX which is pretty niche (run two extended-precision add dependency chains in parallel), and Skylake introduced clflushopt which isn't widely useful.

Skylake and most Broadwell CPUs do have working transactional memory, though, which might be important for some fine-grained multithreading cases. (Haswell was going to have it, but it was disabled in a microcode update after a rare bug was discovered in the implementation.)

AVX512 is the next big thing that's widely useful but Haswell doesn't have, so maybe Apple will add support for a Cannonlake or Ice Lake slice at some point.

I wouldn't recommend making a separate build for Broadwell or Skylake (with any dispatching mechanism), unless you know you can take advantage of a specific new feature and it makes a significant difference.

But it could be potentially useful for Sandybridge, for AVX support without AVX2, especially for 256-bit FP math but also to save movdqa instructions in integer 128-bit vector code. Also for SSE4.x and popcnt. And no partial-flag problems in an extended-precision adc loop using dec/jnz.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 5
    SSSE3 is the correct baseline for a 64-bit processes targeting an arbitrary macOS release. If targeting macOS 10.12 or later, you can assume SSE4.1 as well, since 10.11.x was the last OS version to support pre-Penryn hardware. – Stephen Canon Aug 28 '17 at 14:33
  • 3
    I should also note that it's possible to build a fat macOS binary with `x86_64` and `x86_64h` slices; the `x86_64h` slice will run automatically on Haswell and later µarches, and implies most of the HSW new instructions (FMA, AVX2, BMI, some others). – Stephen Canon Aug 28 '17 at 14:37
  • 1
    @StephenCanon: Nice! Haswell introduced a lot of good stuff; great to have that as a baseline (especially BMI2, which is most useful when you can just let the compiler use it everywhere). – Peter Cordes Aug 28 '17 at 15:01
  • @StephenCanon Quick question: is there something similar for Broadwell and Skylake (e.g. "x86_64b", "x86_64s"), or is this Haswell-only? – saagarjha May 16 '18 at 00:19
  • @saagarjha: IDK, but it's unlikely that you'll get any benefit from `-march=broadwell` or `skylake` vs. `-march=haswell`, unless you have some hand-written asm using ADOX/ADCX, or TSX / RTM. – Peter Cordes May 16 '18 at 01:43
  • 2
    @saagarjha No, only for Haswell. Adding new slices is relatively expensive, so it doesn't happen for every new uArch, but Haswell had a lot of significant ISA improvements. – Stephen Canon May 16 '18 at 18:42
  • What does `mtune` do? I usually use `march`. What's the difference? – Z boson Oct 31 '18 at 08:21
  • 1
    @Zboson: `-mtune` tunes without changing the baseline target. e.g. `rep ret` is needed in some cases on AMD Phenom CPUs, but not on Intel, so `-mtune=haswell` drops that. Vectorizing with 256-bit vectors isn't always worth it, so `-mtune=bdver1`(first-gen Bulldozer) will (I think/hope) prefer 128-bit vectors even if `-mavx` is specified. `-mtune=` anything not ancient will also optimize for cmp/jcc macro-fusion, instead of trying to do the `cmp` early so flags are ready. See also [How does mtune actually work?](https://stackoverflow.com/q/44490331) – Peter Cordes Oct 31 '18 at 08:27
  • What are `march` and `mtune` flags? What do they do? There are other flags starting with `m`, like `mllvm`. Are they somehow related? And what is `haswell`? Godbolt compiler seems to use that too. – Nawaz Jul 24 '20 at 14:00
  • 1
    @Nawaz: clang uses the same options as GCC, GCC's manual documents it better: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html. Haswell is an Intel microarchitecture: https://en.wikipedia.org/wiki/Haswell_(microarchitecture) used in i3/5/7 4xxx. Later CPUs (like current i7 9xxx) are compatible with it. Other `-m` options generally set things about the target machine to compile for. https://clang.llvm.org/docs/ClangCommandLineReference.html – Peter Cordes Jul 24 '20 at 14:05
  • @PeterCordes: That's real quick response. Thanks a lot. I also found another answer: https://stackoverflow.com/a/23267520/415784 and the links in your answer also helped. Thanks a lot. :-) – Nawaz Jul 24 '20 at 14:06