Why is this AVX code slower?

Question

Updated: 19 Aug. 2017, 16:49 UTC

I’m writing an AVX code to multiply a vector with 4 billion components by a constant, however, I see no difference between my small -- I hope -- optimized AVX code and the long scalar compiler optimized version.

Both versions run between 410 ms - 400 ms.

Can someone tell me why it is occurring? And why the large assembly generated by the compiler code takes almost the same time even it's larger ?

It's an important question, because if small computations -- like this multiplication -- have no improvement then it has no sense to use made the manual code in an Intel Core CPU. Perhaps in an Intel Xeon ( with 16 components ) or for more complex computations.

I'm compiling with G++ with parameters: g++ -O3 -mtune=native -march=native -mavx -g3 -Wall -c -fmessage-length=0 -MMD -MP -MF"src/Test AVX.d" -MT"src/Test\ AVX.d" -o "src/Test AVX.o" "../src/Test AVX.cpp"

My CPU is a Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz.

There is the AVX code:

/**
 * Run AVX Code
 */
void AVX() {

    // Loop control
    uint_fast32_t loop = 0;

    // The constant
    __m256 _const = _mm256_set1_ps(5.0f);

    // The register for multiplication
    __m256 _ymm0 = _mm256_setzero_ps();

    // A "buffer" between the vector and the YMM0 register
    float f_data[8];


    // The main loop
    for ( loop = 0  ; loop < SIZE ; loop = loop + 8 ) {

        // Load to buffer
        f_data[0] = vector[loop];
        f_data[1] = vector[loop+1];
        f_data[2] = vector[loop+2];
        f_data[3] = vector[loop+3];
        f_data[4] = vector[loop+4];
        f_data[5] = vector[loop+5];
        f_data[6] = vector[loop+6];
        f_data[7] = vector[loop+7];

        /*
         * I tried to use pointers insted to copy
         * the data, but the software crash
         *
         * float **f_data;
         * f_data = float*[8];
         *
         * f_data[0] = &vector[loop];
         * ...
         *
         */


        // Load to XMM and YMM Registers
        _ymm0 = _mm256_load_ps(f_data);

        // Do the multiplication
        _ymm0 =  _mm256_mul_ps(_ymm0,_const);

        // Copy the results from the register to the "buffer"
        _mm256_store_ps(f_data,_ymm0);

        // Copy from the "buffer" to the vector
        vector[loop] = f_data[0];
        vector[loop+1] = f_data[1];
        vector[loop+2] = f_data[2];
        vector[loop+3] = f_data[3];
        vector[loop+4] = f_data[4];
        vector[loop+5] = f_data[5];
        vector[loop+6] = f_data[6];
        vector[loop+7] = f_data[7];


    }

}

The AVX assembled:

0000000000400de0 <_Z3AVXv>:
  400de0:   48 8b 05 b1 13 20 00    mov    rax,QWORD PTR [rip+0x2013b1]        # 602198 <vector>
  400de7:   c5 fc 28 0d 71 06 00    vmovaps ymm1,YMMWORD PTR [rip+0x671]        # 401460 <_IO_stdin_used+0x40>
  400dee:   00 
  400def:   48 8d 90 00 00 00 40    lea    rdx,[rax+0x40000000]
  400df6:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  400dfd:   00 00 00 
  400e00:   c5 f4 59 00             vmulps ymm0,ymm1,YMMWORD PTR [rax]
  400e04:   48 83 c0 20             add    rax,0x20
  400e08:   c5 fc 11 40 e0          vmovups YMMWORD PTR [rax-0x20],ymm0
  400e0d:   48 39 c2                cmp    rdx,rax
  400e10:   75 ee                   jne    400e00 <_Z3AVXv+0x20>
  400e12:   c5 f8 77                vzeroupper 
  400e15:   c3                      ret    
  400e16:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  400e1d:   00 00 00

The Serial Version:

/**
 * Run Compiler optimized version
 */
void Serial() {

    uint_fast32_t loop;

    // Do the multiplication
    for ( loop = 0 ; loop < SIZE ; loop ++)
        vector[loop] *= 5;

}

The serial assembled:

It's more large, move the data more times and take almost the same time. How it's possible ?

0000000000400e80 <_Z6Serialv>:
  400e80:   48 8b 35 11 13 20 00    mov    rsi,QWORD PTR [rip+0x201311]        # 602198 <vector>
  400e87:   48 89 f0                mov    rax,rsi
  400e8a:   48 c1 e8 02             shr    rax,0x2
  400e8e:   48 f7 d8                neg    rax
  400e91:   83 e0 07                and    eax,0x7
  400e94:   0f 84 96 01 00 00       je     401030 <_Z6Serialv+0x1b0>
  400e9a:   c5 fa 10 05 7a 04 00    vmovss xmm0,DWORD PTR [rip+0x47a]        # 40131c <_IO_stdin_used+0x1c>
  400ea1:   00 
  400ea2:   c5 fa 59 0e             vmulss xmm1,xmm0,DWORD PTR [rsi]
  400ea6:   c5 fa 11 0e             vmovss DWORD PTR [rsi],xmm1
  400eaa:   48 83 f8 01             cmp    rax,0x1
  400eae:   0f 84 8c 01 00 00       je     401040 <_Z6Serialv+0x1c0>
  400eb4:   c5 fa 59 4e 04          vmulss xmm1,xmm0,DWORD PTR [rsi+0x4]
  400eb9:   c5 fa 11 4e 04          vmovss DWORD PTR [rsi+0x4],xmm1
  400ebe:   48 83 f8 02             cmp    rax,0x2
  400ec2:   0f 84 89 01 00 00       je     401051 <_Z6Serialv+0x1d1>
  400ec8:   c5 fa 59 4e 08          vmulss xmm1,xmm0,DWORD PTR [rsi+0x8]
  400ecd:   c5 fa 11 4e 08          vmovss DWORD PTR [rsi+0x8],xmm1
  400ed2:   48 83 f8 03             cmp    rax,0x3
  400ed6:   0f 84 86 01 00 00       je     401062 <_Z6Serialv+0x1e2>
  400edc:   c5 fa 59 4e 0c          vmulss xmm1,xmm0,DWORD PTR [rsi+0xc]
  400ee1:   c5 fa 11 4e 0c          vmovss DWORD PTR [rsi+0xc],xmm1
  400ee6:   48 83 f8 04             cmp    rax,0x4
  400eea:   0f 84 2d 01 00 00       je     40101d <_Z6Serialv+0x19d>
  400ef0:   c5 fa 59 4e 10          vmulss xmm1,xmm0,DWORD PTR [rsi+0x10]
  400ef5:   c5 fa 11 4e 10          vmovss DWORD PTR [rsi+0x10],xmm1
  400efa:   48 83 f8 05             cmp    rax,0x5
  400efe:   0f 84 6f 01 00 00       je     401073 <_Z6Serialv+0x1f3>
  400f04:   c5 fa 59 4e 14          vmulss xmm1,xmm0,DWORD PTR [rsi+0x14]
  400f09:   c5 fa 11 4e 14          vmovss DWORD PTR [rsi+0x14],xmm1
  400f0e:   48 83 f8 06             cmp    rax,0x6
  400f12:   0f 84 6c 01 00 00       je     401084 <_Z6Serialv+0x204>
  400f18:   c5 fa 59 46 18          vmulss xmm0,xmm0,DWORD PTR [rsi+0x18]
  400f1d:   41 b9 f9 ff ff 0f       mov    r9d,0xffffff9
  400f23:   41 ba 07 00 00 00       mov    r10d,0x7
  400f29:   c5 fa 11 46 18          vmovss DWORD PTR [rsi+0x18],xmm0
  400f2e:   41 b8 00 00 00 10       mov    r8d,0x10000000
  400f34:   c5 fc 28 0d 04 04 00    vmovaps ymm1,YMMWORD PTR [rip+0x404]        # 401340 <_IO_stdin_used+0x40>
  400f3b:   00 
  400f3c:   48 8d 0c 86             lea    rcx,[rsi+rax*4]
  400f40:   31 d2                   xor    edx,edx
  400f42:   49 29 c0                sub    r8,rax
  400f45:   31 c0                   xor    eax,eax
  400f47:   4c 89 c7                mov    rdi,r8
  400f4a:   48 c1 ef 03             shr    rdi,0x3
  400f4e:   66 90                   xchg   ax,ax
  400f50:   c5 f4 59 04 01          vmulps ymm0,ymm1,YMMWORD PTR [rcx+rax*1]
  400f55:   48 83 c2 01             add    rdx,0x1
  400f59:   c5 fc 29 04 01          vmovaps YMMWORD PTR [rcx+rax*1],ymm0
  400f5e:   48 83 c0 20             add    rax,0x20
  400f62:   48 39 d7                cmp    rdi,rdx
  400f65:   77 e9                   ja     400f50 <_Z6Serialv+0xd0>
  400f67:   4c 89 c1                mov    rcx,r8
  400f6a:   4c 89 ca                mov    rdx,r9
  400f6d:   48 83 e1 f8             and    rcx,0xfffffffffffffff8
  400f71:   49 8d 04 0a             lea    rax,[r10+rcx*1]
  400f75:   48 29 ca                sub    rdx,rcx
  400f78:   49 39 c8                cmp    r8,rcx
  400f7b:   0f 84 98 00 00 00       je     401019 <_Z6Serialv+0x199>
  400f81:   48 8d 0c 86             lea    rcx,[rsi+rax*4]
  400f85:   c5 fa 10 05 8f 03 00    vmovss xmm0,DWORD PTR [rip+0x38f]        # 40131c <_IO_stdin_used+0x1c>
  400f8c:   00 
  400f8d:   c5 fa 59 09             vmulss xmm1,xmm0,DWORD PTR [rcx]
  400f91:   c5 fa 11 09             vmovss DWORD PTR [rcx],xmm1
  400f95:   48 8d 48 01             lea    rcx,[rax+0x1]
  400f99:   48 83 fa 01             cmp    rdx,0x1
  400f9d:   74 7a                   je     401019 <_Z6Serialv+0x199>
  400f9f:   48 8d 0c 8e             lea    rcx,[rsi+rcx*4]
  400fa3:   c5 fa 59 09             vmulss xmm1,xmm0,DWORD PTR [rcx]
  400fa7:   c5 fa 11 09             vmovss DWORD PTR [rcx],xmm1
  400fab:   48 8d 48 02             lea    rcx,[rax+0x2]
  400faf:   48 83 fa 02             cmp    rdx,0x2
  400fb3:   74 64                   je     401019 <_Z6Serialv+0x199>
  400fb5:   48 8d 0c 8e             lea    rcx,[rsi+rcx*4]
  400fb9:   c5 fa 59 09             vmulss xmm1,xmm0,DWORD PTR [rcx]
  400fbd:   c5 fa 11 09             vmovss DWORD PTR [rcx],xmm1
  400fc1:   48 8d 48 03             lea    rcx,[rax+0x3]
  400fc5:   48 83 fa 03             cmp    rdx,0x3
  400fc9:   74 4e                   je     401019 <_Z6Serialv+0x199>
  400fcb:   48 8d 0c 8e             lea    rcx,[rsi+rcx*4]
  400fcf:   c5 fa 59 09             vmulss xmm1,xmm0,DWORD PTR [rcx]
  400fd3:   c5 fa 11 09             vmovss DWORD PTR [rcx],xmm1
  400fd7:   48 8d 48 04             lea    rcx,[rax+0x4]
  400fdb:   48 83 fa 04             cmp    rdx,0x4
  400fdf:   74 38                   je     401019 <_Z6Serialv+0x199>
  400fe1:   48 8d 0c 8e             lea    rcx,[rsi+rcx*4]
  400fe5:   c5 fa 59 09             vmulss xmm1,xmm0,DWORD PTR [rcx]
  400fe9:   c5 fa 11 09             vmovss DWORD PTR [rcx],xmm1
  400fed:   48 8d 48 05             lea    rcx,[rax+0x5]
  400ff1:   48 83 fa 05             cmp    rdx,0x5
  400ff5:   74 22                   je     401019 <_Z6Serialv+0x199>
  400ff7:   48 8d 0c 8e             lea    rcx,[rsi+rcx*4]
  400ffb:   48 83 c0 06             add    rax,0x6
  400fff:   c5 fa 59 09             vmulss xmm1,xmm0,DWORD PTR [rcx]
  401003:   c5 fa 11 09             vmovss DWORD PTR [rcx],xmm1
  401007:   48 83 fa 06             cmp    rdx,0x6
  40100b:   74 0c                   je     401019 <_Z6Serialv+0x199>
  40100d:   48 8d 04 86             lea    rax,[rsi+rax*4]
  401011:   c5 fa 59 00             vmulss xmm0,xmm0,DWORD PTR [rax]
  401015:   c5 fa 11 00             vmovss DWORD PTR [rax],xmm0
  401019:   c5 f8 77                vzeroupper 
  40101c:   c3                      ret    
  40101d:   41 ba 04 00 00 00       mov    r10d,0x4
  401023:   41 b9 fc ff ff 0f       mov    r9d,0xffffffc
  401029:   e9 00 ff ff ff          jmp    400f2e <_Z6Serialv+0xae>
  40102e:   66 90                   xchg   ax,ax
  401030:   41 b9 00 00 00 10       mov    r9d,0x10000000
  401036:   45 31 d2                xor    r10d,r10d
  401039:   e9 f0 fe ff ff          jmp    400f2e <_Z6Serialv+0xae>
  40103e:   66 90                   xchg   ax,ax
  401040:   41 b9 ff ff ff 0f       mov    r9d,0xfffffff
  401046:   41 ba 01 00 00 00       mov    r10d,0x1
  40104c:   e9 dd fe ff ff          jmp    400f2e <_Z6Serialv+0xae>
  401051:   41 ba 02 00 00 00       mov    r10d,0x2
  401057:   41 b9 fe ff ff 0f       mov    r9d,0xffffffe
  40105d:   e9 cc fe ff ff          jmp    400f2e <_Z6Serialv+0xae>
  401062:   41 ba 03 00 00 00       mov    r10d,0x3
  401068:   41 b9 fd ff ff 0f       mov    r9d,0xffffffd
  40106e:   e9 bb fe ff ff          jmp    400f2e <_Z6Serialv+0xae>
  401073:   41 ba 05 00 00 00       mov    r10d,0x5
  401079:   41 b9 fb ff ff 0f       mov    r9d,0xffffffb
  40107f:   e9 aa fe ff ff          jmp    400f2e <_Z6Serialv+0xae>
  401084:   41 ba 06 00 00 00       mov    r10d,0x6
  40108a:   41 b9 fa ff ff 0f       mov    r9d,0xffffffa
  401090:   e9 99 fe ff ff          jmp    400f2e <_Z6Serialv+0xae>
  401095:   90                      nop
  401096:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  40109d:   00 00 00

The full code:

#include <iostream>
#include <xmmintrin.h>
#include <immintrin.h>


using namespace std;

/**
 * The vector size
 * 268435456 -> 32*8388608 -> 2^32
 */
#define SIZE 268435456

/**
 * The vector for computations
 */
float *vector;

/**
 * Run AVX Code
 */
void AVX() { ... }


/**
 * Run Compiler optimized version
 */
void Serial() { ... }


/**
 * Create the vector
 */
void create() {
    vector = new float[SIZE];
}

/**
 * Fill the vector with data
 * to be used for validation
 */
void fill() {

    uint_fast32_t loop = 0;

    // Fill the vector
    for ( loop = 0  ; loop < SIZE ; loop++ )
        vector[loop] = 1;

}


/**
 * A validation to ensure the compiler have
 * computed all the vector data
 */
void validation() {

    // The loop variable
    unsigned long loop = 0;
    unsigned long errors = 0;
    unsigned long checks = 0;

    for ( loop = 0 ; loop < SIZE ; loop ++  ) {

        // All the vector must be 5
        if ( vector[loop] != 5 ) {
            errors ++;

            // To avoid to show too many errors
            if ( errors < 12 )
                std::cout << loop << ": " << vector[loop] << std::endl;

        }

        checks ++;
    }

    // The result
    std::cout << "Errors: " << errors << "\nChecks: " << checks << std::endl;


}


int main() {

    // Create the vector
    create();
    // Fill with data
    //fill();

    // The tests

    //Serial();
    AVX();

    /*
     * To ensure that the g++ optimization have executed the loop
     */
    //validation();

}

Compiled with: g++ -O3 -mtune=native -march=native -mavx -g3 -Wall -c -fmessage-length=0 -MMD -MP -MF"src/Test AVX.d" -MT"src/Test\ AVX.d" -o "src/Test AVX.o" "../src/Test AVX.cpp"

You should show your entire code that includes the loop that goes over all of your elements. Also, you're invalidly accessing memory in the AVX version. f_data only has 4 elements (128 bits) and you are loading/storing 8 elements at a time. — Jason R, Aug 19 '17 at 14:54
My bad @JasonR, I know the need for 32 bytes alignment, I fixed it, but the performance problem persists. — Amanda Osvaldo, Aug 19 '17 at 15:28
Something is *very* wrong here. Nearly all of the assembly instructions are accessing memory. That shouldn't be happening; all of this data should be enregistered. Are you compiling with optimizations disabled? — Cody Gray - on strike, Aug 19 '17 at 15:30
What does it look like with optimization enabled? The generated code you showed is very inefficient as you would expect with no optimization. — Jason R, Aug 19 '17 at 15:31
You're not compiling with optimization enabled. Your example code does not multiply a million components by a constant, so it is not a simplified version of your original code. It does something different, and when compiled with optimizations enabled is completely compiled away as it calculates a result and then discards the result. — Ross Ridge, Aug 19 '17 at 15:48
@CodyGray, I turned the optimization off because the compiler "jumped" the loop. But take a look in the new version. I had rewrite again in an optimized fashion. — Amanda Osvaldo, Aug 19 '17 at 17:10
@JasonR, The optimization is disabled, but take a look in the new code, I had rewrite again in an optimized fashion. — Amanda Osvaldo, Aug 19 '17 at 17:11
@RossRidge, thanks very much, I had rewrite again in an optimized fashion, please take a look. — Amanda Osvaldo, Aug 19 '17 at 17:12
When the operation is as simple as that, compiler is often able to generate equal (or better) avx code for the loop. The optimized code may look to be bloated, if it must be able to handle all vector sizes, aligned to 16/32 or not. In the end, this operation will be memory bound. — Aki Suihkonen, Aug 20 '17 at 04:39
Now that the question has been edited, I don't understand what you're asking anymore. Are you just asking how it's possible that *more* code could run *faster*? That's not so strange. It's actually pretty common. Here, you see that the compiler has unrolled the loop—a common speed-over-size optimization. Clearly the operation is memory-bound, as opposed to computationally-bound, so SIMD is of little benefit. That's no particular surprise, as the only operation being done here in the inner loop is a single multiplication by a constant. — Cody Gray - on strike, Aug 20 '17 at 10:41
Thanks @AkiSuihkonen, i did not consider the operation is memory bound and not CPU bound. :D — Amanda Osvaldo, Aug 20 '17 at 16:55
Thanks very much @CodyGray, I thought that a constant multiplication could be computationally intense. — Amanda Osvaldo, Aug 20 '17 at 16:58
Even so, the AVX code is average 10 ms slower than Serial Compiler Optimized. — Amanda Osvaldo, Aug 20 '17 at 17:50
Multiplication by a constant is a single operation. Runs in <5 cycles on a modern processor, usually closer to 2 or 3. The AVX code has more overhead, getting the values set up in the AVX registers, so it makes sense that it would be slightly slower. As mentioned above, the operation is memory-bound, not CPU bound, so parallelizing it does you little good as far as performance goes. — Cody Gray - on strike, Aug 21 '17 at 03:07
@CodyGray, I did not take this into account, I think I should calculate opcodes latency and throughput, and CPU bandwidth to be able to consider the best approach for writing a high-performance computation. — Amanda Osvaldo, Aug 22 '17 at 13:55
@CodyGray: The compiler fully unrolled the unaligned-handling startup / cleanup, but there is only one `vmulps` instruction in each version. Also note that latency is not relevant, only throughput, because each iteration is independent. — Peter Cordes, Aug 22 '17 at 19:45

Peter Cordes · Accepted Answer · 2017-08-22T23:37:52.550

Multiplying by 5 is so trivial that you should do that on the fly next time you read the array, or fold it into the code that wrote this array. Loading all that data from RAM into the CPU and storing it back again just to multiply by 5.0 is not efficient.

If you can't just fold it into a different pass of your algorithm, try cache-blocking aka loop-tiling to run multiple steps of your algorithm over a part of this array that fits into cache, before moving on to the next cache-sized block.

Your scalar code auto-vectorizes to nearly the same inner loop as your manually-vectorized version. Neither one is unrolled at all.

The extra code size in gcc's version is just scalar startup / cleanup so its inner loop can use aligned loads/stores. gcc fully unrolls those loops.

Also note that your manually-vectorized code doesn't handle the case where SIZE is not a multiple of 8. (gcc does handle the cleanup at the end even then, because it doesn't know where the alignment boundary will be.)

clang usually just uses unaligned loads/stores on arrays that it can't prove at compile time are always aligned. gcc's default behaviour is maybe good for large arrays that actually are misaligned at run-time, but a total waste of I-cache and branches for cases where the data is in fact aligned at run time most of the time, or for small arrays where doing a bunch of branching and scalar iterations isn't worth it.

The inner loops are nearly the same. In your manually vectorized version, gcc managed to optimize away the element-by-element copy through f_data and emit what you would get from _mm256_loadu_ps(&vector[loop]), instead of actually copying to a local and then doing a vector load. And same for storing back into vector[], luckily for you.

  # top of inner loop in the manually-vectorized version:
  400e00:   c5 f4 59 00             vmulps ymm0,ymm1,YMMWORD PTR [rax]
  400e04:   48 83 c0 20             add    rax,0x20
  400e08:   c5 fc 11 40 e0          vmovups YMMWORD PTR [rax-0x20],ymm0
  400e0d:   48 39 c2                cmp    rdx,rax
  400e10:   75 ee                   jne    400e00 <_Z3AVXv+0x20>

gcc's inner loop uses a loop counter separate from the pointer, so it has an extra instruction, and it uses an indexed addressing mode. vmulps ymm0,ymm1,YMMWORD PTR [rcx+rax*1] can't stay micro-fused on Haswell, so it will issue as 2 fused-domain uops.

  # top of gcc's inner loop:
  400f50:   c5 f4 59 04 01          vmulps ymm0,ymm1,YMMWORD PTR [rcx+rax*1]
  400f55:   48 83 c2 01             add    rdx,0x1
  400f59:   c5 fc 29 04 01          vmovaps YMMWORD PTR [rcx+rax*1],ymm0
  400f5e:   48 83 c0 20             add    rax,0x20
  400f62:   48 39 d7                cmp    rdi,rdx
  400f65:   77 e9                   ja     400f50 <_Z6Serialv+0xd0>

The extra add instruction is another extra uop. This is 6 fused-domain uops (and thus can run at best one iteration per 1.5 cycles, bottlenecked on the front-end).

Your manual version is only 4 fused-domain uops, so it can issue at 1 per clock. It can in theory run that fast if the buffer is hot in L1D cache (or maybe L2), also limited by 1 store per clock.

Of course, since you're running it over a giant buffer, you just bottleneck on memory bandwidth. The minor front-end bottleneck in the auto-vectorized version is a total non-issue. Even an SSE2 version would barely run slower.

You said something about a Xeon with 16 cores. If you want gcc to auto-parallelize as well as SIMD vectorize, you could use OpenMP. As it is, your code is purely single-threaded.

Thanks Very Much, @Peter Corde, your answer put me to think in very things and give me many details that is unknown for me. Really, thanks a lot. And about the Intel Xeon, I'm talking about the AVX 512, avaliable into Xeon Phi and Xeon Skylake microarchicture. — Amanda Osvaldo, Aug 23 '17 at 00:19

Why is this AVX code slower?

1 Answers1