For loop doen't speed up after gcc -O3 optimizition using OpenMP

Question

I write a simple for-loop which assigns a constant to an array.

#include <iostream>
#include <vector>
#include <cstdlib>

#include "omp.h"

using namespace std;
int nr_threads = 1;
long J = 10000000;
long K = 40;

int main(int argc, char* argv[])
{
    nr_threads = atoi(argv[1]);
    vector<double> H_U_d(J*K, 1);

    double start_time = omp_get_wtime();
#pragma omp parallel for num_threads(nr_threads) schedule(static)
    for(long j = 0; j < J*K; j++)
    {
            H_U_d[j] = 1;
    }
    cout << omp_get_wtime()-start_time << endl;
    return 0;
}

and I use gcc to compile it, g++ main.cpp -o test_speedup -fopenmp and test it on a 12-cores machine. My system is Ubuntu 14.04.3 and cpu is Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz with 128GB RAM. If no optimization is applied, we can have such result:

➜ ~ ./test_speedup 1

2.95739

➜ ~ ./test_speedup 8

0.483756

speedup is around 6.

However if I use -O3 to optimize it, g++ main.cpp -o test_speedup -fopenmp -O3

the result is

➜ ~ ./test_speedup 1

0.379158

➜ ~ ./test_speedup 8

0.265842

speedup is poor.

How does gcc optimize the loop? are there any solutions could avoid this?

OpenMP spend some time to create threads. Try test with longer run time. — Yuriy Orlov, Feb 23 '16 at 15:20
The optimized but single thread loop is already very optimal, and might use SIMD instructions to speed it up. Adding threads simply doesn't help that much when the code is already optimal. I recommend you check the assembler code and compare the unoptimized and the optimized code on the assembler level instead. — Some programmer dude, Feb 23 '16 at 15:20
If you want to understand what the compiler is doing, you can start by looking at the assembly generated. `gcc -S` on the source or `objdump -d` on the binary. — Kurt Stutsman, Feb 23 '16 at 15:22

score 1 · Answer 1 · answered Feb 23 '16 at 15:27

1

Your vector H_U_d will not fit into the cache of your processor. Therefore your performance is likely to be limited by the main memory bandwidth. After the main memory bandwidth is saturated by the worker threads, more threads will simply have to wait for the memory.

You might also encounter some numa-effects if you are running on a multi-socket machine. A more specific answer would require more information on our system (processors, memory, OS).

answered Feb 23 '16 at 15:27

Zulan

21,896
6
49
109

Thanks for you advice, I have added my system information. – user1221244 Feb 23 '16 at 15:38
@user1221244 do you know the specific memory configuration (frequency, dual/quad channel configuration). Please note that this is a 6-core CPU (12 hardware threads). 8 threads would be a particularly bad configuration. Try 1,2,3,4,5,6,12 . – Zulan Feb 23 '16 at 16:02

For loop doen't speed up after gcc -O3 optimizition using OpenMP

1 Answers1