Did not get expected performance speed up

Question

I am trying to see the performance speedup of AVX instructions. Below is the example code I am running:

#include <iostream>
#include <stdio.h>
#include <string.h>
#include <cstdlib>
#include <algorithm>
#include <immintrin.h>
#include <chrono>
#include <complex>
//using Type = std::complex<double>;
using Type = double;

int main()
{
        size_t b_size  =  1;
        b_size = (1ul << 30) * b_size;
    Type *d_ptr = (Type*)malloc(sizeof(Type)*b_size);
    for(int i = 0; i < b_size; i++)
    {
        d_ptr[i] = 0;
    }
    std::cout <<"malloc finishes!" << std::endl;
    #ifndef AVX512
            auto a = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < b_size; i ++)
        {
             d_ptr[i] = i*0.1;
        }
        std::cout << d_ptr[b_size-1] << std::endl;
            auto b = std::chrono::high_resolution_clock::now();
            long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
        std::cout << "No avx takes " << diff << std::endl;  
    #else
            auto a = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < b_size; i += 4)
        {
            /* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
             __m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
             __m128d tmp3 = _mm_add_pd(tmp1,tmp2);
             _mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
            __m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
             __m256d tmp2 = _mm256_set_pd(0.1*(i+3),0.1*(i+2),0.1*(i+1),0.1*i);
             __m256d tmp3 = _mm256_add_pd(tmp1,tmp2);
             _mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);
            
        }
        std::cout << d_ptr[b_size-1] << std::endl;
            auto b = std::chrono::high_resolution_clock::now();
            long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
        std::cout << "avx takes " << diff << std::endl; 
    
    #endif
}

I have tested this code on both Haswell and Cascade lake machines, the cases without and with AVX produces quite similar execution times.

---Edit--- Here is the simple compiler command I used:

Without AVX g++ test_avx512_performance.cpp -march=native -o test_avx512_performance_noavx

With AVX g++ test_avx512_performance.cpp -march=native -DAVX512 -o test_avx512_performance

--Edit Again-- I have run the above code on the Haswell machine again. The results are surprising:

Without AVX and compiled with O3:

~$ ./test_avx512_auto_noavx 
malloc finishes!
1.07374e+08
No avx takes 3824740

With AVX and compiled without any optimization flags:

~$ ./test_avx512_auto
malloc finishes!
1.07374e+08
avx takes 2121917

With AVX and compiled with O3:

~$ ./test_avx512_auto_o3 
malloc finishes!
1.07374e+08
avx takes 6307190

It is against what we thought before.

Also, I have implemented a vectorized version (similar to Add+Mul become slower with Intrinsics - where am I wrong? ), see the code below:

#else
            auto a = std::chrono::high_resolution_clock::now();
            __m256d tmp2 = _mm256_set1_pd(0.1);
            __m256d base = _mm256_set_pd(-1.0,-2.0,-3.0,-4.0);
            __m256d tmp3 = _mm256_set1_pd(4.0);
        for (int i = 0; i < b_size; i += 4)
        {
            /* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
             __m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
             __m128d tmp3 = _mm_add_pd(tmp1,tmp2);
             _mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
            __m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
            base = _mm256_add_pd(base,tmp3);
            __m256d tmp5 = _mm256_mul_pd(base,tmp2);
            tmp1 = _mm256_add_pd(tmp1,tmp5);
             _mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp1);

        }
        std::cout << d_ptr[b_size-1] << std::endl;
            auto b = std::chrono::high_resolution_clock::now();
            long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
        std::cout << "avx takes " << diff << std::endl;

    #endif

On the same machine, this gives me:

With AVX and without any optimization flags

~$ ./test_avx512_manual 
malloc finishes!
1.07374e+08
avx takes 2151390

With AVX and with O3:

~$ ./test_avx512_manual_o3 
malloc finishes!
1.07374e+08
avx takes 5965288

Not sure where the problem is. Why O3 gives up worse performance?

have you used optimization flags in the case "without avx" ? — Antonin GAVREL, Mar 09 '21 at 01:26
Are you sure variables are not computed at compilation? You did not use the keyword `volatile` — Antonin GAVREL, Mar 09 '21 at 01:31
1. you forgot to enable optimization!! 2. you're only using AVX1 intrinsics (but fortunately you used your own macro with a misleading name, not the actual `__AVX512VL__` that Haswell wouldn't define). 3. you have `_mm256_set_pd` with non-constant scalar operands inside the inner loop, instead of actually vectorizing. If you did enable `-O3` optimization, GCC would auto-vectorize better than you're doing here. — Peter Cordes, Mar 09 '21 at 02:09
https://godbolt.org/z/GE9Tb6 shows it auto-vectorizing (with just storing). https://godbolt.org/z/YvcErn shows your manual intrinsics that do 4x scalar int->XMM before shuffling together for a SIMD multiply. (GCC could have done better here, but you want `_mm_add_epi32` to increment a vector of counters). Also, you're adding into the destination but the scalar code is just storing, `=` not `+=`. — Peter Cordes, Mar 09 '21 at 02:15
Hi Peter, yes I know that I only used AVX1 intrinsics in this example. My original problem is written with AVX512 intrinsics, but I did not see any speedups, so I thought I would try to use AVX1 to see how things go in the basic scenarios. — flyree, Mar 09 '21 at 02:16
Also duplicate of [Add+Mul become slower with Intrinsics - where am I wrong?](https://stackoverflow.com/q/53834135) for the poor use of _mm_set_pd instead of vectorized index calculations. (Or even do the increment on an FP vector, if you unroll to hide latency.) Actually a really similar duplicate; filling an array with a linear function of an integer counter. So a lot of my answer there actually applies. — Peter Cordes, Mar 09 '21 at 02:19
Anyway, no, you normally wouldn't expect speedups from things that are easy enough for `gcc -O3` to auto-vectorize. And you also shouldn't expect speedups with the default `-O0` - it takes more statements and more inlined function calls to do stuff with intrinsics, which leads to worse debug-mode asm. — Peter Cordes, Mar 09 '21 at 02:20
@PeterCordes Thank you for your detailed suggestions. However, I have rerun the code and did some modifications, and things are strange. Please see my updated question. — flyree, Mar 09 '21 at 20:57
Your scalar version is still doing less work than your manual-vec version (only storing, not RMW). Also, note that GCC optimizes your malloc + zeroing-loop into `calloc` so you have page faults inside your timed region. The initial read may result in extra page faults, if the read leaves it copy-on-write mapped to a shared page of zeros, and has to fault again when you write. So try fixing your SIMD version to match your scalar, or vice versa (`+=`), and init the memory with something non-zero like 0.1 before the timed region. [See this answer](//stackoverflow.com/q/60291987). — Peter Cordes, Mar 10 '21 at 03:43
Also, `base = _mm256_add_pd(base,tmp3);` creates a loop-carried dependency chain that's 3 cycles long on Haswell, 4 on Skylake. You didn't unroll at all, and GCC won't do it for you (and couldn't with strict FP math). As explained in [Add+Mul become slower with Intrinsics - where am I wrong?](https://stackoverflow.com/q/53834135), unrolling to hide that latency is key, like add `_mm256_set1_pd(8.0)` to two separate vectors to keep them going in strides. Auto-vectorization (https://godbolt.org/z/GEsYb5) using int->double conversion every iteration costs more insns so you can beat it. — Peter Cordes, Mar 10 '21 at 03:45

Did not get expected performance speed up

0 Answers0

Linked