0

I have the following code:

    omp_set_num_threads(10);
    
    int N = 10000;
    int chunk = 1000;
    int i;
    float a[N];

    for (i=0; i < N; i++) 
    {
            a[i] = i;
    }

    #pragma omp parallel private(i)
    {
            #pragma omp for schedule(dynamic,chunk) nowait
            for (i=0; i < N; i++) a[i] = a[i] + a[i];
    } 

The sequential code is the same, without the two pragma directives.

sequential ~150us parallel ~1100us

That's a huge gap, and I expected it to be the other way around.

Does someone have a clue what's wrong, or does OpenMP have so much to do in the background? Does someone have an example, where I can see that the parallelized for loop is faster?

Thanks

gab
  • 157
  • 2
  • 9
  • 2
    Parallel execution is usually best when the cost of doing one thing is not outweighed by the cost of scheduling things across multiple execution engines and having to coordinate everything. The cost of doubling 10,000 items is pretty low. – paxdiablo Aug 28 '20 at 04:54
  • Thanks, that was the problem! Is there a way to preallocate the threads? – gab Aug 28 '20 at 05:11
  • 1
    @gab: The threads have to be actually launched. They don't spring into existence on their own, and there's no way to make them exist for free through "preallocation". OpenMP hides the complexity of doing so from you, but the work still has to be done. – ShadowRanger Aug 28 '20 at 05:15
  • Since all sane OpenMP runtimes keep a thread pool, the cost of creating the threads is only paid once no matter how many parallel regions you have. Therefore a better way to measure the cost you'd see in a real code which is using OpenMP in many places in is to have an untimed, empty, parallel region to start the threads before you time the parallel region you are interested in. (And be careful which clock you use. omp_get_wtime() is OK; CPU consumption is not!) – Jim Cownie Aug 28 '20 at 07:35

1 Answers1

2

Launching ten threads and managing them is more expensive than adding 10,000 elements. If you're on a system with less than 10 true cores, they're going to compete for time slices, and incur context switches and (potentially) more cache misses. And you chose dynamic scheduling, which means there needs to be some synchronization to help the threads figure out which chunks they're going to do (not a lot, but enough to slow things down when the work being distributed is pretty trivial by contrast).

In one anecdote launching a no-op thread and immediately detaching it cost about 10 μs, and those threads didn't need to do anything. In your case, that's 100 μs just for launching the threads, ignoring all the other inefficiencies threading potentially introduces.

Parallelizing helps for big workloads, or when the workers are used many times for many moderate sized tasks (so the cost of launching the threads is a fraction of the work they do). But performing 10,000 additions is chump change to a CPU; you're just not doing enough to benefit from parallelizing it.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271