AVX instructions generated when -xSSE4.1 specified

Question

I have compiled a piece of code with the option -xSSE4.1 using the Intel compiler. When I looked at the generated assembly file, I see that AVX instructions such as 'vpmovzxbw' have been inserted. But, the executable still seems to run on machines that don't support the AVX instruction set. What explains this?

Here's the particular code snippet -

C -> src0_8x16b  = _mm_cvtepu8_epi16 (src0_8x16b);

Assembly -> vpmovzxbw xmm4, QWORD PTR [rcx]

Binary -> 00066 c4 62 79 30 29

Here's another snippet where the assembly instruction uses 3 operands -

C -> src0_8x16b = _mm_sub_epi16 (src0_8x16b, src1_8x16b);

Assembly -> vpsubw xmm1, xmm13, xmm11              

Binary -> 000bc c4 c1 11 f9 cb

For comparison, here's the disassembly generated by icc for the function 'foo' (The only difference between the function foo and the code snippet above is that the code snippet was coded using intrinsics) -

Compiler commands used - 
icc -S -xSSE4.1 -axavx -O3 foo.c

Function foo -
void foo(float *x, int n) 
{
    int i;

    for(i=0; i<n; i++) x[i] *= 2.0;
}

Autodispatch code - 
testl     $-131072, __intel_cpu_indicator(%rip)         #1.27
jne       foo.R                                         #1.27
testl     $-1, __intel_cpu_indicator(%rip)              #1.27
jne       foo.A

Loop in foo.R (AVX variant) - 
vmulps    (%rdi,%rcx,4), %ymm0, %ymm1                   #3.24
vmulps    32(%rdi,%rcx,4), %ymm0, %ymm2                 #3.24
vmovups   %ymm1, (%rdi,%rcx,4)                          #3.24
vmovups   %ymm2, 32(%rdi,%rcx,4)                        #3.24
addq      $16, %rcx                                     #3.5
cmpq      %rdx, %rcx                                    #3.5
jb        ..B2.12       # Prob 82%                      #3.5

Loop in foo.A (SSE variant) - 
movaps    (%rdi,%r8,4), %xmm1                           #3.24
movaps    16(%rdi,%r8,4), %xmm2                         #3.24
mulps     %xmm0, %xmm1                                  #3.24
mulps     %xmm0, %xmm2                                  #3.24
movaps    %xmm1, (%rdi,%r8,4)                           #3.24
movaps    %xmm2, 16(%rdi,%r8,4)                         #3.24
addq      $8, %r8                                       #3.5
cmpq      %rsi, %r8                                     #3.5
jb        ..B3.12       # Prob 82%                      #3.5

http://www.felixcloutier.com/x86/PMOVZX.html You probably confuse with VPMOVZXBW — , Dec 29 '15 at 10:54
pmovzx is sse41. vpmovzxbw is avx. Check [link](https://software.intel.com/en-us/node/524007) — ashwin, Dec 29 '15 at 11:07
Maybe it generates an AVX version of some things, but only runs it after doing a run-time check that the system supports AVX? Post a snipped of disassembly, including the binary machine code, so we can make sure it's really the VEX-encoding. Ideally, set a breakpoint at that instruction and make sure it actually ever runs, too, if you have a debugger on a pre-AVX machine. — Peter Cordes, Dec 30 '15 at 06:32
Can you please show your code, compiler options, compiler version, and the assembly generated. — Z boson, Dec 30 '15 at 08:25
ICC generates a CPU dispatcher according to Agner Fog. He goes into a lot of detail about this. I don't know how it works. I assumed it would apply to libraries and not to your own code. But my experience with ICC in the past had [trouble getting ICC to generated the code I told it to](http://stackoverflow.com/questions/17031192/intel-c-compiler-icc-seems-to-ingnore-sse-avx-seetings). — Z boson, Dec 30 '15 at 08:27
Oops, obviously it never a VEX instruction on a pre-AVX machine, or else your program would generate an illegal-instruction fault. It's highly unlikely that Intel's compiler generates code that does *that* on purpose and uses an OS-specific method to try again with non-AVX code. So NVM about using the debugger on a non-AVX machine. It might be interesting to check with a breakpoint that it does run on a machine that *does* have AVX. — Peter Cordes, Dec 30 '15 at 09:24
I have indeed checked on an SSE42 machine and it seems to run fine. The interesting part though is that the assembly generated does not change one iota when I change the option -xSSE4.1 to -xAVX. I have enabled the Auto Dispatcher of icc though and it may possibly be doing a runtime switch between AVX and SSE variants, but I have not seen any instructions to that effect in the disassembly. — ashwin, Dec 30 '15 at 09:34
My theory is that use of the xmm registers implies compatibility with SSE4 and above and ymm registers implies AVX and above. — ashwin, Dec 30 '15 at 09:43
@ashwin: your theory is completely wrong. Older CPUs (Intel pre Sandybridge) will fault (illegal instruction) with VEX-encoded 128bit instructions. There is a separate machine encoding, which is why I suggested posting the binary machine code along side the disassembly output (like from Agner Fog's `objconv`, or from GNU `objdump -d -Mintel`: `31 ff xor edi,edi`). (To rule out disassembler error.) If there are any 3-operand AVX instructions (like `VPXOR, xmm0, xmm0, xmm0`), that would be stronger confirmation. VEX insns zero the upper half of the dest ymm, non-VEX don't. — Peter Cordes, Dec 30 '15 at 09:53
The runtime dispatcher will have to run `CPUID` at some point. IDK how ICC works, maybe it still makes variants for older CPUs even with `-xAVX`, if you enable runtime dispatching. I would have expected that runtime dispatch + a target machine level would mean it considered the specified extension baseline, and didn't make any variants for older machines. — Peter Cordes, Dec 30 '15 at 09:56
Your experience of getting the same code independent of your options `-xSSE4.1` or `-xAVX` is what I observed before but I did not explicitly enable the dispatcher. How did you do that? You should add the information to your question. I guess I was implicitly using the dispatcher. BTW, I don't know ICC well but is `-xAVX` the same as `-mavx`? — Z boson, Dec 30 '15 at 11:46
You can tell ICC to dispatch for AVX, SSE4.1 and SSE2 like this `-axAVX -axSSE4.2 -xSSE2`. Can you print out the full compiler options that are used? Maybe `-axAVX` is implicit for some reason? — Z boson, Dec 30 '15 at 11:54
Well what do you expect! Those options tell ICC to generate code for AVX and SSE4.1! — Z boson, Dec 30 '15 at 12:09
@Zboson, in response to one of your previous comments, -xavx is used in icc whereas -mavx is used for gcc. — ashwin, Dec 30 '15 at 13:01
@ashwin ICC recognizes `-mavx` as well. I think it's the same as `-xavx`. — Z boson, Dec 30 '15 at 13:06
@PeterCordes, how much advantage is there 128-bit vectors with and without vex encoding on a machine with AVX (but not AVX2)? — Z boson, Dec 31 '15 at 13:10
@Zboson: depending on the algorithm, you can save a lot of `mov` instructions. This helps if you're running into frontend bottlenecks. On SnB (the only AVX CPU without mov-elimination), it also helps with vector execution port pressure. AVX also lets you fold unaligned loads, which also reduces code size and frontend pressure. You also get `VBROADCASTSS xmm, m32` (but not the reg, reg form; that's AVX2). — Peter Cordes, Dec 31 '15 at 15:10
@PeterCordes, thanks. You wrote "On SnB (the only AVX CPU without mov-elimination)". What do you mean by that? — Z boson, Jan 01 '16 at 14:32
@Zboson: I mean SnB (the specific microarch, not the family) is the only microarch from AMD or Intel that supports AVX, but that needs an ALU execution unit for `movdqa xmm1, xmm0`. AMD Bulldozer-family and Intel IvB and later handle it in the register-rename stage with zero latency. That's what I meant, but it turns out **I'm wrong**: AMD Jaguar has AVX, but the insn tables indicate it runs vector reg-reg moves on the FP0/1 execution units, with non-zero latency. — Peter Cordes, Jan 01 '16 at 16:41
@PeterCordes, switching subjects a bit. Do you think this [answer is correct](http://stackoverflow.com/a/21151780/2542702)? I mean that fusing loads in some cases actually can give worse performance? — Z boson, Jan 01 '16 at 19:05
@Zboson: I've tested the latency in the cache-hit case. See the last bit of http://stackoverflow.com/a/31027695/224132. That might only give an OOO window of only the scheduler size (e.g. 32 uops), while unfused load/use might have an OOO window of the full ROB size (e.g. 192 uops). IDK how one would go about testing that. You'd need a case where the load address was ready far ahead of time, and perf was bottlenecked on having the loads start many cycles before they were needed. — Peter Cordes, Jan 01 '16 at 21:15
@PeterCordes, well you could test the code in my question. I originally found the GCC solution faster than the MSVC solution which is why I created that answer. The only major difference I found was that GCC did not fold the loads and MSVC did. But I could only read a bit of asm then and I understand a lot more sense so I would not be super surprised if it was something else. I have not look into it since. — Z boson, Jan 02 '16 at 09:06

score 3 · Answer 1 · answered Jan 04 '16 at 07:01

I have tried to replicate the results on two other compilers, viz., gcc and Microsoft Visual Studio's v100 compilers. I was unable to do so, i.e., gcc and v100 compilers seem to be generating the correct disassemblies. As a further step, I looked closely at the differences, if any, that existed between the compiler arguments that I had specified in each case. It turns out that whilst using the icc compiler, I had enabled the option to inherit project defaults for compiling this particular file. The project settings were configured such that this option was included -

-xavx

As a result when this file was being compiled, the settings I had provided -

-xSSE4.1 -axavx

were overridden by the former. This was the cause of the behavior I have detailed in my question.

I am sorry for this error, but I shall not delete this question since @Zboson 's answer is exceptional.

PS - I had mentioned in one of my comments that I was able to run this code on an SSE42 machine. That was because the exe I had run on that machine was indeed SSE41 compliant since I had apparently used an exe generated using the gcc compiler. I ran the icc generated exe and it was indeed crashing with an illegal instruction error on the SSE42 machine.

Thanks for the explanation. That makes sense. – Z boson Jan 04 '16 at 08:31 — Z boson, Jan 04 '16 at 08:31

Z boson · Answer 2 · 2015-12-30T13:35:42.343

2

The Intel compiler can

generate a single executable with multiple levels of vectorization with the -ax flag,

For example to generate code which is compatible with AVX, SSE4.1 and SSE2 to use -axAVX -axSSE4.2 -xSSE2.

Since you compiled with -axAVX -xSSE4.1 Intel generated a AVX branch and a SSE4.1 branch and at runtime it determines which instruct set is available and chooses that.

Agner Fog has a good description of Intel's CPU dispatcher in his Optimizing C++ manaul. See section "13.7 CPU dispatching in Intel compiler". Intel's CPU dispatcher is not ideal for several reasons, one of which is that it plays bad on AMD, which Agner describes in detail. Personally I would make my own dispatcher.

I compiled the following code with ICC 13.0 with options -O3 -axavx -xsse2

void foo(float *x, int n) {
    for(int i=0; i<n; i++) x[i] *= 2.0;
}

and the start of the assembly is

    test      DWORD PTR __intel_cpu_indicator[rip], -131072 #1.27
    jne       _Z3fooPfi.R                                   #1.27
    test      DWORD PTR __intel_cpu_indicator[rip], -1      #1.27
    jne       _Z3fooPfi.A

going to the _Z3fooPfi.R branch find the main AVX loop

..B2.12:                        # Preds ..B2.12 ..B2.11
vmulps    ymm1, ymm0, YMMWORD PTR [rdi+rcx*4]           #2.25
vmulps    ymm2, ymm0, YMMWORD PTR [32+rdi+rcx*4]        #2.25
vmovups   YMMWORD PTR [rdi+rcx*4], ymm1                 #2.25
vmovups   YMMWORD PTR [32+rdi+rcx*4], ymm2              #2.25
add       rcx, 16                                       #2.2
cmp       rcx, rdx                                      #2.2
jb        ..B2.12       # Prob 82%                      #2.2

going to the _Z3fooPfi.A branch has the main SSE loop

movaps    xmm1, XMMWORD PTR [rdi+r8*4]                  #2.25
movaps    xmm2, XMMWORD PTR [16+rdi+r8*4]               #2.25
mulps     xmm1, xmm0                                    #2.25
mulps     xmm2, xmm0                                    #2.25
movaps    XMMWORD PTR [rdi+r8*4], xmm1                  #2.25
movaps    XMMWORD PTR [16+rdi+r8*4], xmm2               #2.25
add       r8, 8                                         #2.2
cmp       r8, rsi                                       #2.2
jb        ..B3.12       # Prob 82%                      #2.2

edited Dec 30 '15 at 13:35

answered Dec 30 '15 at 12:18

Z boson

32,619
11
123
226

but when I use an SSE41 intrinsic, I would expect an SSE41 assembly instruction!. The options I have used will only tell the compiler to create two variants of the same function where applicable and switch between them runtime. – ashwin Dec 30 '15 at 12:20
@ashwin, your code would crash if it only generated AVX instructions on a machine without AVX. It must be generating SSE code but you just have not found it yet. – Z boson Dec 30 '15 at 12:21
I understand that the code should crash. Hence, my question :). If you sift through my previous comments, you should have noticed that I have tested this on an SSE42 machine and it ran fine. Probably as you said I have not found the SSE code yet but I doubt it since I have looked through the disassembly generated on both Windows and Linux (which were on two different versions of the icc compiler BTW) and these were the only instructions I found between the function _begin_ and _end_ – ashwin Dec 30 '15 at 12:36
@ashwin, the resason I said it must have SSE code is because you said n previous comments that you testted on a machine without AVX. That's why I asked you to post your code (not just a single line/instruction). Why don't you write a short simple version which can reproduce your results. It should be easy to make a little `foo` function to do this. – Z boson Dec 30 '15 at 12:40
@ashwin you can also compile with only `-xSSE4.1` and look at the assembly and then compare with `-axAVX -xSSE4.1`. – Z boson Dec 30 '15 at 12:43
I did that and the disassembly was identical in both cases. I am guessing that the disassembly is possibly just indicative and I ought to probably look at the machine code and check whether that is compatible with SSE. From this [link](http://www.felixcloutier.com/x86/PMOVZX.html), it appears that the opcode generated corresponds to PMOVZXBW. – ashwin Dec 30 '15 at 12:54
@ashwin, how are you getting the disassembly? With GCC I use `-S` or instead I disassembly the object file or executable with `objdump -d` (you may want to try `objdump -D`). Why don't you just compile the object file and try `objdump -d foo.o`? The dispatcher will probably not be in the object file but I would guess the different branches would be. – Z boson Dec 30 '15 at 13:04
I used /FAcs in Windows and -S in linux. The dispatcher code is generally present in the same file. – ashwin Dec 30 '15 at 13:12
I am not looking at the executable. The option I mentioned generates a separate .asm file along with the object files and the executables. – ashwin Dec 30 '15 at 13:16
I have used the -S option in linux and the behavior there is identical. I am assuming here that the behavior of both icc -S and objdump are similar. Just to be sure, I tried objdump too and nothing's changed. – ashwin Dec 30 '15 at 13:22
@ashwin, I updated my answer with an example. It produces exactly what I would expect. Two branches with AVX and without and even shows how it calls the dispatcher. I don't know what problem you are having. – Z boson Dec 30 '15 at 13:36
There are no such branches generated in my case. – ashwin Dec 30 '15 at 13:45
Are you using the commercial version or the non-commercial version of the compiler (e.g. the one for students)? – Z boson Dec 30 '15 at 21:28
Can you compile the `foo` function in my answer with your compiler using `-S -O3 -axavx` and add the assembly to your question? – Z boson Dec 30 '15 at 21:44
I am using the commercial version of icc. I shall compile foo and will add the assembly to my question – ashwin Dec 31 '15 at 05:30
@ashwin, thank you. So your compiler does what I expect for a simple function. Now we need to find out why it fails for your function. But you never posted your function like I asked multiple times. I voted to close your question since you don't provide code to reproduce your problem. Please updated your question with YOUR code which is not doing what you expect. – Z boson Dec 31 '15 at 08:55
I have provided the relevant code. There are only a few instructions that seem to exhibit this problem. One is **_mm_cvtepu8_epi16** and another one is **_mm_sub_epi16**. Most of the other instructions are translated into their SSE variants. That is the reason I have highlighted the above two instructions in my question. – ashwin Dec 31 '15 at 09:26
@ashwin, sorry, I think auto-vectorization may be a distraction/red herring. You are using intrinsics so the rules may be different. Nevertheless, you need to provide a minimal working example which reproduces your problem. You still have not done that. It's strange that icc is produced a vex encoded instruction but only using 128-bit vectors. – Z boson Dec 31 '15 at 12:16
@ashwin, I wrote a little function using `_mm_sub_epi16` and compiled it with icc and it only generates `psubw`. I never get a vex encoded version. I can't reproduce your problem. You need to provide code to reproduce the problem. – Z boson Dec 31 '15 at 12:31
icc would use vex encoded instructions but only 128-bit vectors because you are using integers and not maybe it's not compiling for AVX2. It's like you complied with `-xavx`. – Z boson Dec 31 '15 at 13:04
I can't provide whole of the code since it is proprietary, hence the snippets. I shall try compiling it using gcc and check whether this behavior is specific to icc. I shall also try and include a more meaningful portion of the code in the question. I realise now that I ought to have done this at the very beginning. – ashwin Dec 31 '15 at 13:11
Intel Compiler Dispatcher is really tricky. It doesn't follow documentation. It is like a black magic. – Royi Apr 16 '19 at 06:17

AVX instructions generated when -xSSE4.1 specified

2 Answers2