AVX 3.6x slower than IA32 in simple benchmark involving operations - why so? (VS2013)

Question

I'll preface this by saying C++ is not my typical area of work, I'm more often in C# and Matlab. I also don't pretend to be able to read x86 assembly code. Having seen some videos recently though on "modern c++" and new instructions on latest processors, I figured I'd poke around a bit more and see what I can learn. I do have some existing C++ DLL's which benefit from speed improvements - those DLL's using many trig and power operations from <cmath>.

So I whip up a simple benchmark program in VS2013 Express / Desktop. Processor on my machine here is an Intel i7-4800MQ (Haswell). Program is pretty simple, allocates some std::vector<double>'s to a size of 5 million random entries, then loops over doing some math operation combining the values. I measure the time spent using std::chrono::high_resolution_clock::now() immediately preceding and following the loop:

[Edit: Including full program code]

#include "stdafx.h"
#include <chrono>
#include <random>
#include <cmath>
#include <iostream>
#include <string>

int _tmain(int argc, _TCHAR* argv[])
{

    // Set up random number generator
    std::tr1::mt19937 eng;
    std::tr1::normal_distribution<float> dist;

    // Number of calculations to do
    uint32_t n_points = 5000000;

    // Input vectors
    std::vector<double> x1;
    std::vector<double> x2;
    std::vector<double> x3;

    // Output vectors
    std::vector<double> y1;

    // Initialize
    x1.reserve(n_points);
    x2.reserve(n_points);
    x3.reserve(n_points);
    y1.reserve(n_points);

    // Fill inputs
    for (size_t i = 0; i < n_points; i++)
    {
        x1.push_back(dist(eng));
        x2.push_back(dist(eng));
        x3.push_back(dist(eng));
    }

    // Start timer
    auto start_time = std::chrono::high_resolution_clock::now();

    // Do math loop
    for (size_t i = 0; i < n_points; i++)
    {
        double result_value; 

        result_value = std::sin(x1[i]) * x2[i] * std::atan(x3[i]);

        y1.push_back(result_value);
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Duration: " << duration.count() << " ms";

    return 0;
}

I put VS into Release configuration with standard options (e.g. /O2). I do one build with /arch:IA32 and run it a few times, and another with /arch:AVX and run it a few times. Consistently, putting the AVX option is ~3.6x slower than the IA32 alternative. In this specific example, to the tune of 773 ms compared to 216.

As a sanity check I did try some other very basic operations.. combination of mults and adds.. taking some number to the 8th power.. and between the two AVX is at least as fast if not a bit faster. So why might my code above be impacted to much? Or where might I look to find out?

Edit 2: At the suggestion of someone on Reddit, I changed the code around into something more vectorize-able... which makes both SSE2 and AVX run faster, but AVX is still much slower than SSE2:

#include "stdafx.h"
#include <chrono>
#include <random>
#include <cmath>
#include <iostream>
#include <string>

int _tmain(int argc, _TCHAR* argv[])
{

    // Set up random number generator
    std::tr1::mt19937 eng;
    std::tr1::normal_distribution<double> dist;

    // Number of calculations to do
    uint32_t n_points = 5000000;

    // Input vectors
    std::vector<double> x1;
    std::vector<double> x2;
    std::vector<double> x3;

    // Output vectors
    std::vector<double> y1;

    // Initialize
    x1.reserve(n_points);
    x2.reserve(n_points);
    x3.reserve(n_points);
    y1.reserve(n_points);

    // Fill inputs
    for (size_t i = 0; i < n_points; i++)
    {
        x1.push_back(dist(eng));
        x2.push_back(dist(eng));
        x3.push_back(dist(eng));
        y1.push_back(0.0);
    }

    // Start timer
    auto start_time = std::chrono::high_resolution_clock::now();

    // Do math loop
    for (size_t i = 0; i < n_points; i++)
    {
        y1[i] = std::sin(x1[i]) * x2[i] * std::atan(x3[i]);
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Duration: " << duration.count() << " ms";

    return 0;
}

IA32: 209 ms SSE: 205 ms SSE2: 75 ms AVX: 371 ms

As for specific version of Visual Studio, this is 2013 Express for Desktop Update 1 (Version 12.0.30110.00 Update 1)

Did you try specifying /arch:AVX2 (OK for Haswell) instead of /arch:AVX? and what was an outcome? Also, what if you replace std::vector with C arrays? I don't have MSVS2013 available to check right now. Side note (just my personal thinking): MS autovectorizer has probably been under active development for 3-4 years only. For other C/C++ compilers - vectorizers are under development at least for 15-20 years. And autovectorization is very complex Comp.Science and practical problem, so compiler probably has to mature for some time to start generating binary with predictable performance. — zam, Apr 21 '14 at 10:21
@zam AVX2 not available in Express 2013 (at the moment anyway... perhaps I should check if a new download is available). I did try SSE and SSE2 which were comparable to IA32. I'd have to go back and jot down specific numbers. — Tom S, Apr 21 '14 at 12:20
AVX2 switch is claimed to be available in MSDN. I don't have ideas about Express-limitations here. Oh, if SSE2 performs really better (how much?) than AVX (for exactly the same version of code), then it might be an indication of roughly 3 things a) MS Compiler bugs/immaturity: inefficient compiler code generation or internal compilation switches troubles; did/can you try Intel C/C++ Compiler or GCC? b) Memory Bandwidth demand (often case, but you will need to do more advanced work to triage that), c) Small (<100) number of loop iterations (definitely not your case). — zam, Apr 21 '14 at 17:09
And I forgot about 4rd reason (most likely one for SSE vs. AVX) - some troubles with MS math library implementation using SSE intrinsics + some inefficient dispatches. Please also consider dropping std::, taking in mind given post: http://stackoverflow.com/questions/6976458/why-is-stdsin-and-stdcos-slower-than-sin-and-cos — zam, Apr 21 '14 at 17:39
I've played with this extensively, the VS2013 vectorizor does not know how to use the full 512bit width AVX register and treats it like an SSE register. However you should be very careful about mixing SSE and AVX instructions doing so causes a major slowdown. Also recognize that trig functions are going to be slow as they cannot be vectorized. — Mgetz, Apr 22 '14 at 02:07
@Mgetz AVX's register is only 256 bits. Also, did you try it with floating-point maths? Since AVX doesn't have integer maths — phuclv, Apr 22 '14 at 12:25
@LưuVĩnhPhúc My appologies on the register width, AVX2 will be 512bits, AVX1 is 256. My experiments were focused on Double precision matrix multiplies which AVX is VERY conducive to. My experiments showed that AVX beat SSE2/x87 handily IF AND ONLY IF the matrix was not a sparse matrix. x87 apparently detects a multiply by zero and instantly returns at least on my core i7. In x86-64 AVX beat everything hands down regardless of operation or matrix state. — Mgetz, Apr 22 '14 at 12:30
@Mgetz AVX2 is still 256 bits, only AVX-512 which is only available in 2015 since Knights Landing architecture is 512 bits — phuclv, Apr 22 '14 at 12:34
@LưuVĩnhPhúc Seems to be some conflicting information out there I have seen some information indicating that skylake might ship with it but it wasn't fully confirmed. — Mgetz, Apr 22 '14 at 12:37
@TomS try to use simple and easy to vectorize code and benchmark. You should also check the [MSVC's auto vectorizer's report](http://msdn.microsoft.com/en-us/library/hh872235.aspx) and fix it in order for it to emit vectorized code. VS can vectorize sin/cos but I don't know if it can vectorize arctan or not http://blogs.msdn.com/b/nativeconcurrency/archive/2012/05/22/auto-vectorizer-in-visual-studio-11-did-it-work.aspx — phuclv, Apr 22 '14 at 12:47
This might be useful http://stackoverflow.com/questions/21960229/unexpectedly-good-performance-with-openmp-parallel-for-loop/21965635#21965635 — Z boson, Apr 23 '14 at 07:23
WARNING: Don't use ``std::chrono::high_resolution_clock`` for profiling with VS 2013 or VS 2012. Use ``QueryPerformanceCounter`` to get the expected high resolution. See [VS Connect](https://connect.microsoft.com/VisualStudio/feedback/details/719443/). Note this is fixed for [Visual Studio "14"](http://blogs.msdn.com/b/vcblog/archive/2014/06/06/c-14-stl-features-fixes-and-breaking-changes-in-visual-studio-14-ctp1.aspx). — Chuck Walbourn, Oct 02 '14 at 07:18

score 2 · Answer 1 · answered Apr 22 '14 at 14:43

When the CPU switches between using AVX and SSE instructions, it needs to save/restore the upper parts of the ymm registers and can result in a pretty large penalty.

Normally compiling with /arch:AVX will fix this for your own code, as it will use AVX128 instructions instead of SSE ones where possible. However in this case, it may be that your standard library's math functions are not implemented using AVX instructions, in which case you'd get a transition penalty for every function call. You'd have to post a disassembled version to be sure.

You often see VZEROUPPER being called before a transition to signal that the CPU doesn't need to save the upper parts of the registers, but the compiler is not smart enough to know if a function it calls requires it too.

Mgetz · Answer 2 · 2014-04-22T14:55:40.110

So based on @Lưu Vĩnh Phúc I investigated a bit, you can get this to vectorize very nicely but not using std::vector or std::valarray, I also had to alias the pointers when I used std::unique_ptr otherwise that too would block vectorization.

#include <chrono>
#include <random>
#include <math.h>
#include <iostream>
#include <string>
#include <valarray>
#include <functional>
#include <memory>

#pragma intrinsic(sin, atan)
int wmain(int argc, wchar_t* argv[])
{

    // Set up random number generator
    std::random_device rd;
    std::mt19937 eng(rd());
    std::normal_distribution<double> dist;

    // Number of calculations to do
    const uint32_t n_points = 5000000;

    // Input vectors
    std::unique_ptr<double[]> x1 = std::make_unique<double[]>(n_points);
    std::unique_ptr<double[]> x2 = std::make_unique<double[]>(n_points);
    std::unique_ptr<double[]> x3 = std::make_unique<double[]>(n_points);

    // Output vectors
    std::unique_ptr<double[]> y1 = std::make_unique<double[]>(n_points);
    auto random = std::bind(dist, eng);
    // Fill inputs
    for (size_t i = 0; i < n_points; i++)
    {
        x1[i] = random();
        x2[i] = random();
        x3[i] = random();
        y1[i] = 0.0;
    }

    // Start timer
    auto start_time = std::chrono::high_resolution_clock::now();

    // Do math loop
    double * x_1 = x1.get(), *x_2 = x2.get(), *x_3 = x3.get(), *y_1 = y1.get();
    for (size_t i = 0; i < n_points; ++i)
    {
        y_1[i] = sin(x_1[i]) * x_2[i] * atan(x_3[i]);
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Duration: " << duration.count() << " ms";
    std::cin.ignore();
    return 0;
}

On my machine compiled with /arch:avx this took 103ms, /arch:IA32: 252ms, nothing set: 98ms

Looking at the generated assembly it seems the vector functions are implemented using SSE, as such using AVX instructions around them would cause an impedance and slow things down. Hopefully MS will implement AVX versions in the future.

The relevant asm lacking vzeroupper:

$LL3@wmain:
    vmovupd xmm0, XMMWORD PTR [esi]
    call    ___vdecl_sin2
    mov eax, DWORD PTR tv1250[esp+10212]
    vmulpd  xmm0, xmm0, XMMWORD PTR [eax+esi]
    mov eax, DWORD PTR tv1249[esp+10212]
    vmovaps XMMWORD PTR tv1240[esp+10212], xmm0
    vmovupd xmm0, XMMWORD PTR [eax+esi]
    call    ___vdecl_atan2
    dec DWORD PTR tv1260[esp+10212]
    lea esi, DWORD PTR [esi+16]
    vmulpd  xmm0, xmm0, XMMWORD PTR tv1240[esp+10212]
    vmovupd XMMWORD PTR [edi+esi-16], xmm0
    jne SHORT $LL3@wmain

Versus the SSE2 asm note the same vector sin and atan calls:

$LL3@wmain:
    movupd  xmm0, XMMWORD PTR [esi]
    call    ___vdecl_sin2
    mov eax, DWORD PTR tv1250[esp+10164]
    movupd  xmm1, XMMWORD PTR [eax+esi]
    mov eax, DWORD PTR tv1249[esp+10164]
    mulpd   xmm0, xmm1
    movaps  XMMWORD PTR tv1241[esp+10164], xmm0
    movupd  xmm0, XMMWORD PTR [eax+esi]
    call    ___vdecl_atan2
    dec DWORD PTR tv1260[esp+10164]
    lea esi, DWORD PTR [esi+16]
    movaps  xmm1, XMMWORD PTR tv1241[esp+10164]
    mulpd   xmm1, xmm0
    movupd  XMMWORD PTR [edi+esi-16], xmm1
    jne SHORT $LL3@wmain

Other things of note:

VS is only using the bottom 128bits of the AVX register, despite being 256bits wide
There are no overloads of the vector functions for AVX
AVX2 isn't supported yet

AVX 3.6x slower than IA32 in simple benchmark involving operations - why so? (VS2013)

2 Answers2