Why does a loop over an array run faster without optimization vs. gcc -O3? Array was initialized with malloc + zeroing loop

Question

I am sorry to post this question again with some updates. The previous one has been closed. I am trying to see the performance speedup of AVX instructions. Below is the example code I am running:

#include <iostream>
#include <stdio.h>
#include <string.h>
#include <cstdlib>
#include <algorithm>
#include <immintrin.h>
#include <chrono>
#include <complex>
//using Type = std::complex<double>;
using Type = double;

int main()
{
        size_t b_size  =  1;
        b_size = (1ul << 30) * b_size;
    Type *d_ptr = (Type*)malloc(sizeof(Type)*b_size);
    for(int i = 0; i < b_size; i++)
    {
        d_ptr[i] = 0;
    }
    std::cout <<"malloc finishes!" << std::endl;
    #ifndef AVX512
            auto a = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < b_size; i ++)
        {
             d_ptr[i] = i*0.1;
        }
        std::cout << d_ptr[b_size-1] << std::endl;
            auto b = std::chrono::high_resolution_clock::now();
            long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
        std::cout << "No avx takes " << diff << std::endl;  
    #else
            auto a = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < b_size; i += 4)
        {
            /* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
             __m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
             __m128d tmp3 = _mm_add_pd(tmp1,tmp2);
             _mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
            __m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
             __m256d tmp2 = _mm256_set_pd(0.1*(i+3),0.1*(i+2),0.1*(i+1),0.1*i);
             __m256d tmp3 = _mm256_add_pd(tmp1,tmp2);
             _mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);
            
        }
        std::cout << d_ptr[b_size-1] << std::endl;
            auto b = std::chrono::high_resolution_clock::now();
            long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
        std::cout << "avx takes " << diff << std::endl; 
    
    #endif
}

I have run the above code on the Haswell machine. The results are surprising:

Without AVX and compiled with O3:

~$ ./test_avx512_auto_noavx 
malloc finishes!
1.07374e+08
No avx takes 3824740

With AVX and compiled without any optimization flags:

~$ ./test_avx512_auto
malloc finishes!
1.07374e+08
avx takes 2121917

With AVX and compiled with O3:

~$ ./test_avx512_auto_o3 
malloc finishes!
1.07374e+08
avx takes 6307190

It is against what we thought before.

Also, I have implemented a vectorized version (similar to Add+Mul become slower with Intrinsics - where am I wrong? ), see the code below:

#else
            auto a = std::chrono::high_resolution_clock::now();
            __m256d tmp2 = _mm256_set1_pd(0.1);
            __m256d base = _mm256_set_pd(-1.0,-2.0,-3.0,-4.0);
            __m256d tmp3 = _mm256_set1_pd(4.0);
        for (int i = 0; i < b_size; i += 4)
        {
            /* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
             __m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
             __m128d tmp3 = _mm_add_pd(tmp1,tmp2);
             _mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
            __m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
            base = _mm256_add_pd(base,tmp3);
            __m256d tmp5 = _mm256_mul_pd(base,tmp2);
            tmp1 = _mm256_add_pd(tmp1,tmp5);
             _mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp1);

        }
        std::cout << d_ptr[b_size-1] << std::endl;
            auto b = std::chrono::high_resolution_clock::now();
            long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
        std::cout << "avx takes " << diff << std::endl;

    #endif

On the same machine, this gives me:

With AVX and without any optimization flags

~$ ./test_avx512_manual 
malloc finishes!
1.07374e+08
avx takes 2151390

With AVX and with O3:

~$ ./test_avx512_manual_o3 
malloc finishes!
1.07374e+08
avx takes 5965288

Not sure where the problem is. Why O3 gives up worse performance?

Editor's note: in the executable names,

_avx512_ seems to be -march=native, even though Haswell only has AVX2.
_manual vs. _auto seems to be -DAVX512 to use the manually-vectorized AVX1 code or the compiler's auto-vectorization of the scalar code that only writes with = instead of += like the intrinsics are doing.

Without optimization, the init loop is actually touching the memory and getting the page faults done *before* the timed region. So despite the actual loop being a lot less efficient, not having to pay the cost of page faults made it an overall win. See [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) re: page faults on memory you touch for the first time. Init with something non-zero to avoid it would be the easiest thing here. — Peter Cordes, Mar 10 '21 at 04:47
Still looking for a Q&A about GCC compiling malloc+memset(0) (or equivalent loop) into calloc; that's the key here. If you edit out the fluff that's a duplicate of [Did not get expected performance speed up](https://stackoverflow.com/q/66539433) and just focus on the part that's faster with -O0 (default) than -O3, maybe here is the right place to post that as an answer. But with the current question having auto vs. manual vectorization, and confusing `AVX512` macro which you're calling "auto" vs. "manual" separate from your -march= options I think, it's not a good place to put an answer. — Peter Cordes, Mar 10 '21 at 05:07
Also related: [Why vectorizing the loop does not have performance improvement](https://stackoverflow.com/a/18159503) - here, gains from smarter vectorization will be hard to see because of the memory bandwidth bottleneck. If you looped repeatedly over a small array that fits in L2 cache or even L1d, you'd have room to beat the compiler's auto-vectorization. (And page faults wouldn't be dominating your run-time.) — Peter Cordes, Mar 10 '21 at 05:14

Why does a loop over an array run faster without optimization vs. gcc -O3? Array was initialized with malloc + zeroing loop

0 Answers0