-1

I am benchmarking a set of applications on a SandyBridge processor (i7-3820). The benchmark consists of two different versions. These two versions contain the same code with the only difference that the first version uses sse/sse2 instrinsics and the second version uses avx instrinsics.

For the compilation of the benchmark I am using the Visual Studio 2015.

Compiling the version with sse instrinsics either on x64 or x86, the execution time is almost the same. But compiling the benchmark with avx instrinsics for x64, the execution time is worst (almost double) comparing the benchmark with avx instrinsisc and compiled for x86. Furthermore, the execution time of avx benchmark compiled with x86 succeeds only a small speed up (x8%) comparing the benchmark of sse instrinsics.

Finally, I tested the above configurations on an Ivy Bridge processor (i7-3770) and the execution times execution times between x64 and x86 for avx instrincis was same. But the avx intrinsics didn't show any improvement against the sse.

Is there any explanation about the bad performance of avx instrinsics on Sandy Bridge for compiling for x64?

Why the two architectures doesn't show any speed up for the avx instruction against the sse instructions?

Moreover, I tried different compilation switching from arch:AVX to /arch:SSE2 and vice versa but nothing was changed at execution times. But if I am right, the 'Enable Enhanced instruction set' property in visual studio effects only the vectorization.

Thanks in advance.

  • 3
    Could you post the code too? – harold Jun 27 '16 at 17:11
  • 3
    It really depends on which AVX instructions you're using - some offer no benefit over their SSE equivalents. However without seeing your code then we can only guess, which isn't very constructive. Please post the relevant code, preferably as a [mcve]. – Paul R Jun 27 '16 at 21:19
  • `/arch:AVX` effects more than vectorization, it allows vex encoded instructions i.e. if you try and use AVX intrinsics without `/arch:AVX` it won't work. – Z boson Jun 28 '16 at 06:13
  • 1
    `s/instrinsics/intrinsics/g` – Z boson Jun 28 '16 at 06:16

1 Answers1

2

compiling the benchmark with avx instrinsics for x64, the execution time is worst

Almost certainly from AVX<->SSE transition delays, from mixing legacy SSE with 256bit AVX instructions without a vzeroupper.

See also Using AVX CPU instructions: Poor performance without "/arch:AVX"

x64 binaries probably use legacy SSE2 instructions for scalar FP math. If you compile all your code with AVX enabled, those instructions should use the VEX encoding. But you still need vzeroupper around calls to library functions.

Your x86 32bit binary probably doesn't use any legacy SSE2 instructions between AVX functions, maybe not even in library function calls.


edit: i7-3820 is 32nm SandyBridge-E, not IvyBridge, my mistake. See Agner Fog's microarch pdf and the tag wiki if you're curious about the difference between SnB and IvB.

I think you're saying that AVX was less of a speedup over SSE on your IvB. One of the major new features in IvB is mov-elimination. It handles movdqa xmm,xmm register->register moves in the rename stage with zero latency, without needing an execution unit.

If you're only seeing AVX transition delays on one computer, maybe you're compiling with different libraries or compiler versions.

If you want more of an answer than this, put some actual numbers in table of bullet list where we can see them all easily.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Why the OP observe transition delays with Sandy Bridge but not with Ivy Bridge? The OP wrote "I tested the above configurations on an Ivy Bridge processor (i7-3770) and the execution times execution times between x64 and x86 for avx instrincis was same. But the avx intrinsics didn't show any improvement against the sse." So infer that to mean that the OP does not see worse x86-64 performance with AVX on Ivy Bridge. – Z boson Jun 29 '16 at 06:11
  • @Zboson: oh, I didn't even notice he called one of his i7-3xxx CPUs an IvB. I looked at the numbers and saw they were both IvB, so I assumed they were the same microarch and missed what he was saying about perf differences. Some actual numbers in an easy-to-look-at format (like a bullet list or table) would go a long way to making this a better question. – Peter Cordes Jun 29 '16 at 12:56
  • Anyway, if the OP used different compilers or libraries on the different computers, maybe one of them did vzeroupper automatically? – Peter Cordes Jun 29 '16 at 12:56
  • 1
    @Zboson: oops, I just realized that i7-3820 is Sandybridge-E. Intel has this annoying numbering scheme where the -E parts get the same leading digit as the regular next-gen parts – Peter Cordes Jun 29 '16 at 13:22
  • Yeah, the OP should include some code and numbers either of those would be helpful. – Z boson Jun 29 '16 at 13:24