36

I wrote some code to do a bunch of math, and it needs to go fast, so I need it to use SSE and AVX instructions. I'm compiling it using g++ with the flags -O3 and -march=native, so I think it's using SSE and AVX instructions, but I'm not sure. Most of my code looks something like the following:

for(int i = 0;i<size;i++){
    a[i] = b[i] * c[i];
}

Is there any way I can tell if my code (after compilation) uses SSE and AVX instructions? I think I could look at the assembly to see, but I don't know assembly, and I don't know how to see the assembly that the compiler outputs.

phuclv
  • 37,963
  • 15
  • 156
  • 475
BadProgrammer99
  • 759
  • 1
  • 5
  • 13
  • You might want to use the vector extensions too. – Jester Dec 19 '17 at 00:28
  • 4
    Get `GCC` to output assembler `g++ -S -o prog.s prog.cpp` – Galik Dec 19 '17 at 00:32
  • 3
    For looking at compiler output: https://stackoverflow.com/questions/38552116/how-to-remove-noise-from-gcc-clang-assembly-output. @Galik: Obviously you have to use `g++ -march=native -O3 -S` to get asm output with optimizations. Also note that you will see SSE instructions in scalar FP code, like `vaddsd` to add doubles. You're looking for `vmulpd` (packed double), `vmulps` (packed scalar), or `vpmulld` (integer packed add dword (32-bit elements) or other packed-integer multiply instructions depending on the type of `b` and `c`. – Peter Cordes Dec 19 '17 at 01:41
  • That is a very common calculation. See std::inner_product. A GPU might be dozens of times faster for that. Also investigate using OMP. How big are the vectors? – Jive Dadson Dec 19 '17 at 06:55
  • @JiveDadson It's a bit more complicated than the example above because it's on a strided array that represents a tensor. The GPU would go way faster, but I know absolutely nothing about using it, so I'm going to write CPU code first. Also, I'm already using OpenMP. – BadProgrammer99 Dec 19 '17 at 07:17
  • [How to check if a binary requires SSE4 or AVX on Linux](https://superuser.com/q/726395/241386) – phuclv Jul 22 '18 at 14:50

5 Answers5

46

Under Linux, you could also decompile your binary:

objdump -d YOURFILE > YOURFILE.asm

Then find all SSE instructions:

awk '/[ \t](addps|addss|andnps|andps|cmpps|cmpss|comiss|cvtpi2ps|cvtps2pi|cvtsi2ss|cvtss2s|cvttps2pi|cvttss2si|divps|divss|ldmxcsr|maxps|maxss|minps|minss|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movss|movups|mulps|mulss|orps|rcpps|rcpss|rsqrtps|rsqrtss|shufps|sqrtps|sqrtss|stmxcsr|subps|subss|ucomiss|unpckhps|unpcklps|xorps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|psadbw|pshufw)[ \t]/' YOURFILE.asm

Find only packed SSE instructions (suggested by @Peter Cordes in comments):

awk '/[ \t](addps|andnps|andps|cmpps|cvtpi2ps|cvtps2pi|cvttps2pi|divps|maxps|minps|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movntq|movups|mulps|orps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|pmulhuw|psadbw|pshufw|rcpps|rsqrtps|shufps|sqrtps|subps|unpckhps|unpcklps|xorps)[ \t]/' YOURFILE.asm

Find all SSE2 instructions (except MOVSD and CMPSD, which were first introduced in 80386):

awk '/[ \t](addpd|addsd|andnpd|andpd|cmppd|comisd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvtsd2si|cvtsd2ss|cvtsi2sd|cvtss2sd|cvttpd2dq|cvttpd2pi|cvtps2dq|cvttsd2si|divpd|divsd|maxpd|maxsd|minpd|minsd|movapd|movhpd|movlpd|movmskpd|movupd|mulpd|mulsd|orpd|shufpd|sqrtpd|sqrtsd|subpd|subsd|ucomisd|unpckhpd|unpcklpd|xorpd|movdq2q|movdqa|movdqu|movq2dq|paddq|pmuludq|pshufhw|pshuflw|pshufd|pslldq|psrldq|punpckhqdq|punpcklqdq)[ \t]/' YOURFILE.asm

Find only packed SSE2 instructions:

awk '/[ \t](addpd|andnpd|andpd|cmppd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvttpd2dq|cvttpd2pi|cvttps2dq|divpd|maxpd|minpd|movapd|movapd|movhpd|movhpd|movlpd|movlpd|movmskpd|movntdq|movntpd|movupd|movupd|mulpd|orpd|pshufd|pshufhw|pshuflw|pslldq|psrldq|punpckhqdq|shufpd|sqrtpd|subpd|unpckhpd|unpcklpd|xorpd)[ \t]/' YOURFILE.asm

Find all SSE3 instructions:

awk '/[ \t](addsubpd|addsubps|haddpd|haddps|hsubpd|hsubps|movddup|movshdup|movsldup|lddqu|fisttp)[ \t]/' YOURFILE.asm

Find all SSSE3 instructions:

awk '/[ \t](psignw|psignd|psignb|pshufb|pmulhrsw|pmaddubsw|phsubw|phsubsw|phsubd|phaddw|phaddsw|phaddd|palignr|pabsw|pabsd|pabsb)[ \t]/' YOURFILE.asm

Find all SSE4 instructions:

awk '/[ \t](mpsadbw|phminposuw|pmulld|pmuldq|dpps|dppd|blendps|blendpd|blendvps|blendvpd|pblendvb|pblenddw|pminsb|pmaxsb|pminuw|pmaxuw|pminud|pmaxud|pminsd|pmaxsd|roundps|roundss|roundpd|roundsd|insertps|pinsrb|pinsrd|pinsrq|extractps|pextrb|pextrd|pextrw|pextrq|pmovsxbw|pmovzxbw|pmovsxbd|pmovzxbd|pmovsxbq|pmovzxbq|pmovsxwd|pmovzxwd|pmovsxwq|pmovzxwq|pmovsxdq|pmovzxdq|ptest|pcmpeqq|pcmpgtq|packusdw|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|crc32|popcnt|movntdqa|extrq|insertq|movntsd|movntss|lzcnt)[ \t]/' YOURFILE.asm

Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd):

awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' YOURFILE.asm

NOTE: tested with gawk and nawk.

Andriy Makukha
  • 7,580
  • 1
  • 38
  • 49
  • 2
    You probably don't want to look for scalar SSE and SSE2 instructions; this question is tagged `[simd]` so the OP (and most other people) aren't interested in normal scalar `addss` / `addsd` or `[u]comisd`, only `addps` / `addpd` / `cmppd`. I already pointed this out [on your first version of this answer on another question](https://stackoverflow.com/questions/17109410/how-can-i-check-if-my-installed-numpy-is-compiled-with-sse-sse2-instruction-set/49221140#49221140). (I knew this `gawk` + regex thing looked familiar, so I put a phrase from this answer into google and found the original .:) – Peter Cordes Apr 27 '18 at 09:21
  • @PeterCordes, thanks :) I just found this question when I tried to find my older answer, so decided to post that answer here as well, but slightly updated due to your previous comment. As for SIMD/scalar distinction, your suggestion might be good in some cases, but the question here does not clearly restrict interest to only SIMD instructions. The `[simd]` tag might be used to draw attention from people familiar with SSE and AVX instructions. – Andriy Makukha Apr 27 '18 at 09:37
  • 1
    *it needs to go fast, so I need it to use SSE and AVX instructions.* clearly indicates the OP only means auto-vectorization, and probably doesn't realize that SSE2 instructions are used for scalar FP math. If you want to keep including scalar instructions in your regexes, you should clearly state that in your answer so people know what they're getting. – Peter Cordes Apr 27 '18 at 09:38
  • @PeterCordes, you are right. Improved my answer to give two packed-only options (for SSE and SSE2). – Andriy Makukha Apr 27 '18 at 10:22
  • if you don't care about the asm file then you can use process substitution like this `gawk '/\<(yourIns...)\>/' <(objdump -d YOURFILE)` – phuclv Jul 12 '18 at 16:05
  • The [MOVSD](http://felixcloutier.com/x86/MOVS:MOVSB:MOVSW:MOVSD:MOVSQ.html) and [CMPSD](http://felixcloutier.com/x86/CMPS:CMPSB:CMPSW:CMPSD:CMPSQ.html) in 80386 is for string operations and completely irrelevant to the SSE2 instructions: [CMPSD](http://felixcloutier.com/x86/CMPSD.html), [MOVSD](http://felixcloutier.com/x86/MOVSD.html) – phuclv Aug 02 '18 at 04:16
  • @phuclv, yes, this is the reason why I omit these keywords from search. – Andriy Makukha Aug 02 '18 at 04:32
  • 2
    Just a note that these awk expressions are too long for mawk (the default awk on Ubuntu 18.04): `regular expression /[ \t](addp ... exceeds implementation size limit`. gawk is fine. – bain Aug 26 '19 at 16:38
  • @bain, that is surprising. Isn't gawk the default Linux AWK implementation? Or did Ubuntu community decide to switch to mawk recently? mawk has many limitations... – Andriy Makukha Aug 26 '19 at 17:23
  • 1
    Mawk is a dependency of ubuntu-minimal. Looks like it's been that way since [at least Ubuntu 16.04](https://packages.ubuntu.com/xenial/ubuntu-minimal), possibly earlier. – bain Aug 27 '19 at 14:52
  • Seems to work without issues with mawk version 1.3.4 on MacOS. – Andriy Makukha Feb 20 '20 at 13:46
  • @AndriyMakukha is there also a way to know if the binary was compiled using gcc / Intel or any other compiler? – joepol Dec 13 '20 at 06:17
  • @joepol, in some cases there is a way. For example, [here is one](https://stackoverflow.com/questions/42818737/how-to-tell-if-a-library-is-compiled-with-certain-gcc-version). But generally I'm not aware of an easy way to tell the difference. – Andriy Makukha Dec 13 '20 at 06:46
  • This also works with WSL on Windows binaries. – jakar Apr 13 '21 at 21:20
  • 1
    @PeterCordes have a [fork of objdump](https://github.com/goldsteinn/binutils-gdb-annotate-sse2) here that annotates the SSE2 usage. – Noah Jun 18 '22 at 03:54
21

There is no need to check the assembly. Most compilers provide optimisation reports that exactly tell you whether or not your loops were vectorised using SIMD instructions.

If you compile using GCC, set -O3 -march=native to make sure vectorisation is performed using whichever SIMD instruction set (SSE, AVX, ...) the CPU you are compiling on supports, and add -fopt-info to make the compiler verbose about optimisations:

g++ -O3 -march=native -fopt-info -o main.o main.cpp

This will give you output like:

main.cpp:12:20: note: loop vectorized
main.cpp:12:20: note: loop peeled for vectorization to enhance alignment

Hope that helps.

Alexander
  • 9,737
  • 4
  • 53
  • 59
noma
  • 1,171
  • 6
  • 15
2

Notice that most packed SSE instructions end with PS/PD we'll have a simpler way to check for packed SSEx instructions after dumping the binary content to asmfile

grep %xmm asmfile | grep -P '([[:xdigit:]]{2}\s)+\s*[[:alnum:]]+p[sd]\s+'

or the xmm check can be combined into the pattern

grep -P '([[:xdigit:]]{2}\s)+\s*[[:alnum:]]+p[sd]\s+.+xmm' asmfile

This will suffice for programs only use floating-point operations. However for better coverage you also need to check for instructions begin with P so you need to change the regex a bit

grep -P '([[:xdigit:]]{2}\s)+\s*([[:alnum:]]+p[sd]\s+|p[[:alnum:]]+).+%xmm' asmfile

To also include MMX instructions in 32-bit code change the %xmm part at the end to %x?mm

To check for AVX1/2 you just need to find ymm or %ymm usage instead of checking the instruction name, because AVX1/2 instructions only have the vector version

grep ymm asmfile

Similarly AVX-512 can be checked with

grep zmm asmfile
phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 5
    `ymm` is short enough that it could appear inside a symbol name. `objdump -d binary | egrep '%ymm[[:digit:]]+(,|$)'` might be better. Taking advantage of AT&T syntax decoration with `%` is a good way to avoid false positives. The check for being followed by comma or coming at the end of a line is probably overkill because I don't think `%` could appear in a symbol name. – Peter Cordes Jul 22 '18 at 17:04
  • @PeterCordes or you can also strip the binary before passing to grep if it contains debug information – phuclv Jul 22 '18 at 17:08
  • True, but function names with external linkage will still be present, and potentially also global vars as operands. (Plus, you might want to know *which* functions use AVX instructions, in which case you'd want to not strip). – Peter Cordes Jul 22 '18 at 17:12
  • Not all packed SSE instruction names end with PS/PD. For example: [PMAXUB/PMAXUW](https://www.felixcloutier.com/x86/PMAXUB:PMAXUW.html), [PAVGB/PAVGW](https://www.felixcloutier.com/x86/PAVGB:PAVGW.html), [CVTPS2PI](https://www.felixcloutier.com/x86/CVTPS2PI.html). – Andriy Makukha Aug 02 '18 at 04:43
  • @AndriyMakukha yes but checking those is often unnecessary. It's highly probable that a program using SIMD will have other SIMD instructions like add, mul... That's why I didn't even bother to check load/store, conversion and many shuffle instructions, just like how CMPSD can be omitted. I put the check for instructions begin with P in the MMX part though. Anyway I'll update my answer – phuclv Aug 02 '18 at 04:58
  • This answer kind of implies that all AVX-512 instructions involve 512-bit vectors. But it introduced some new instructions that are useful with 128 or 256-bit vectors, e.g. `vpternlogd`, and some compilers default to auto-vectorizing with 256-bit by default on Skylake-avx512. If you want to detect 512-bit vector usage, then yes look for ZMM. Otherwise (detecting AVX512VL with shorter vectors, similarly `vfmadd` with scalar or 128-bit) it's harder, although any usage of `%k[0-9]` is definitely AVX512 if any instructions do that. – Peter Cordes Oct 03 '20 at 07:24
0

The only way to tell is to disassemble to the generated code and see what instructions it's using.

objdump -d <your executable or shared library>
phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 5
    This is only the easy part which 5 seconds with google could solve. The hard part is recognizing auto-vectorized code vs. scalar, because they both use the same registers (for scalar FP at least). – Peter Cordes Dec 19 '17 at 03:55
  • @Peter Cordes: Auto-vectorized by the compiler? This still has to produce assembly instructions that can be easily examined. The original question asked how to tell if the generated assembly was using SSE or AVX instructions. Find the function, look at the instructions. One doesn't need to even understand compiler optimizations to examine the generated instructions and see if there are any SSE or AVX instructions in assembly generated for the function in question. – Michael Ngarimu Oct 03 '20 at 07:13
  • That's true, but notice that the question mentions *performance*. The person who asked probably didn't realize that SSE or AVX will be used for scalar FP math. So what they really wanted to know doesn't match their question title. – Peter Cordes Oct 03 '20 at 07:19
-4

As others have pointed out, you may use -S to generate assembly code.

What's more, you could use external tools to disassembe the compiled binary, like objdump, or more professional one, ida.