Why doesn't my OpenMP program scale with number of threads?

Question

I write a program to calculate the sum of an array of 1M numbers where all elements = 1. I use OpenMP for multithreading. However, the run time doesn't scale with the number of threads. Here is the code:

#include <iostream>
#include <omp.h>
#define SIZE 1000000
#define N_THREADS 4
using namespace std;



int main() {

    int* arr = new int[SIZE];
    long long sum = 0;
    int n_threads = 0;

    omp_set_num_threads(N_THREADS);
    double t1 = omp_get_wtime();

    #pragma omp parallel
    {
        if (omp_get_thread_num() == 0) {
            n_threads = omp_get_num_threads();
        }

        #pragma omp for schedule(static, 16)
        for (int i = 0; i < SIZE; i++) {
            arr[i] = 1;
        }

        #pragma omp for schedule(static, 16) reduction(+:sum)
        for (int i = 0; i < SIZE; i++) {
            sum += arr[i];
        }

    }

    double t2 = omp_get_wtime();

    cout << "n_threads " << n_threads << endl;
    cout << "time " << (t2 - t1)*1000 << endl;
    cout << sum << endl;

    
}

The run time (in milliseconds) for different values of N_THREADS is as follows:

n_threads 1
time 3.6718

n_threads 2
time 2.5308

n_threads 3
time 3.4383

n_threads 4
time 3.7427

n_threads 5
time 2.4621

I used schedule(static, 16) to use chunks of 16 iterations per thread to avoid false sharing problem. I thought the performance issue was related to false sharing, but I now think it's not. What could possibly be the problem?

Did you enabled OpenMP? What is the command line you build your program? What is your operating system? Please read [this](https://stackoverflow.com/questions/60291987/idiomatic-way-of-performance-evaluation/60293070#60293070). — Jérôme Richard, Oct 22 '21 at 18:15
Like Jérôme Richard, I think the problem is in the measurements. The measure imprecision is probably (much) bigger than the actual work. Try to do the same work 1000 times for example. And read the linked answer. — prapin, Oct 22 '21 at 18:29

Laci · Answer 1 · 2021-10-22T18:56:23.587

0

Your code is memory bound, not computation expensive. Its speed depends on the speed of memory access (cache utilization, number of memory channels, etc), therefore it is not expected to scale well with the number of threads.

UPDATE, I run this code using 1000x bigger SIZE (i.e. #define SIZE 100000000) (g++ -fopenmp -O3 -mavx2)

Here are the results, it still scales badly with number of threads:

n_threads 1
time 652.656
time 657.207
time 608.838
time 639.168
1000000000

n_threads 2
time 422.621
time 373.995
time 425.819
time 386.511
time 466.632
time 394.198
1000000000

n_threads 3
time 394.419
time 391.283
time 470.925
time 375.833
time 442.268
time 449.611
time 370.12
time 458.79
1000000000

n_threads 4
time 421.89
time 402.363
time 424.738
time 414.368
time 491.843
time 429.757
time 431.459
time 497.566
1000000000

n_threads 8
time 414.426
time 430.29
time 494.899
time 442.164
time 458.576
time 449.313
time 452.309
1000000000

edited Oct 22 '21 at 18:56

answered Oct 22 '21 at 18:05

Laci

2,738
1
13
22

I though you were right at first glance, but a quick calculation show that the theoretical throughput is pretty low: about 2~3 Gio/s. Most modern platforms can do much better than that ;) . Typically 5~15 times more. – Jérôme Richard Oct 22 '21 at 18:40
@Laci, any suggestions on how to make it run faster (as no. of threads increase)? – Abdulwahab Almestekawy Oct 22 '21 at 19:11
Are you sure it is the performance bottleneck in your program? – Laci Oct 22 '21 at 19:16
I'm not sure, but I was trying to find a way to calculate sum of N integers faster than the sequential method. In theory, it should be faster. But how can I verify this in practice? – Abdulwahab Almestekawy Oct 22 '21 at 19:22
I suggest you to use a profiler to find hotspots of your program and focus on those parts. Also make sure that you use proper compiler flags for optimization. – Laci Oct 23 '21 at 06:42

huseyin tugrul buyukisik · Answer 2 · 2021-10-22T22:44:29.167

0

5 threads contending for same accumulator for reduction or having only 16 chunk size must be inhibiting efficient pipelining of loop iterations. Try coarser region per thread.

Maybe more importantly, you need multiple repeats of benchmark programmatically to get an average and to heat CPU caches/cores into higher frequencies to have better measurement.

The benchmark results saying 1MB/s. Surely the worst RAM will do 1000 times better than that. So memory is not bottleneck (for now). 1 million elements per 4 second is like locking contention or non-heated benchmark. Normally even a Pentium 1 would make more bandwidth than that. You sure you are compiling with O3 optimization?

edited Oct 22 '21 at 22:44

answered Oct 22 '21 at 18:29

huseyin tugrul buyukisik

11,469
4
45
97

Do you mean increasing the chunk size? – Abdulwahab Almestekawy Oct 22 '21 at 18:40
1

Yes. Even if each thread is doing its own local accumulation, it would be cache friendly to use bigger chunks, for locality of reference/spatial locality. – huseyin tugrul buyukisik Oct 22 '21 at 18:42
I tried to increase the chunk size (up to 100k). Still no performance gain. Do you have any suggestions? – Abdulwahab Almestekawy Oct 22 '21 at 19:13
Also in benchmarking it is important to make heating cycles before benchmarking and do multiple repeats and get an average. – huseyin tugrul buyukisik Oct 22 '21 at 19:41
Ok. I tried something that looks stupid: I tried to keep my threads busy by using an empty for loop that does nothing. Placing such a loop in either parallel loop has shown significant difference in execution time for different number of threads. What does this indicate? – Abdulwahab Almestekawy Oct 22 '21 at 20:38
It could be about caches. Perhaps it predicted usage of array and optimized by getting values there before running benchmark. Microbenchmarking is hard. Especially when you try measuring nanosecond-long functions. Even latency of rdtsc can be needed to subtract from cpu cycle counts. – huseyin tugrul buyukisik Oct 22 '21 at 21:31

score 0 · Answer 3 · answered Oct 25 '21 at 12:28

I have reimplemented the test as a Google Benchmark with different values:

#include <benchmark/benchmark.h>
#include <memory>
#include <omp.h>

constexpr int SCALE{32};
constexpr int ARRAY_SIZE{1000000};
constexpr int CHUNK_SIZE{16};

void original_benchmark(benchmark::State& state) 
{
    const int num_threads{state.range(0)};
    const int array_size{state.range(1)};
    const int chunk_size{state.range(2)};

    auto arr = std::make_unique<int[]>(array_size);
    long long sum = 0;
    int n_threads = 0;

    omp_set_num_threads(num_threads);
    // double t1 = omp_get_wtime();

    #pragma omp parallel
    {
        if (omp_get_thread_num() == 0) {
            n_threads = omp_get_num_threads();
        }

        #pragma omp for schedule(static, chunk_size)
        for (int i = 0; i < array_size; i++) {
            arr[i] = 1;
        }

        #pragma omp for schedule(static, chunk_size) reduction(+:sum)
        for (int i = 0; i < array_size; i++) {
            sum += arr[i];
        }
    }

    // double t2 = omp_get_wtime();

    // cout << "n_threads " << n_threads << endl;
    // cout << "time " << (t2 - t1)*1000 << endl;
    // cout << sum << endl;

    state.counters["n_threads"] = n_threads;
}

static void BM_original_benchmark(benchmark::State& state) {
    for (auto _ : state) {
        original_benchmark(state);
    }
}

BENCHMARK(BM_original_benchmark)
    ->Args({1, ARRAY_SIZE,         CHUNK_SIZE})
    ->Args({1, SCALE * ARRAY_SIZE, CHUNK_SIZE})
    ->Args({1, ARRAY_SIZE,         SCALE * CHUNK_SIZE})

    ->Args({2, ARRAY_SIZE,         CHUNK_SIZE})
    ->Args({2, SCALE * ARRAY_SIZE, CHUNK_SIZE})
    ->Args({2, ARRAY_SIZE,         SCALE * CHUNK_SIZE})

    ->Args({4, ARRAY_SIZE,         CHUNK_SIZE})
    ->Args({4, SCALE * ARRAY_SIZE, CHUNK_SIZE})
    ->Args({4, ARRAY_SIZE,         SCALE * CHUNK_SIZE})

    ->Args({8, ARRAY_SIZE,         CHUNK_SIZE})
    ->Args({8, SCALE * ARRAY_SIZE, CHUNK_SIZE})
    ->Args({8, ARRAY_SIZE,         SCALE * CHUNK_SIZE})

    ->Args({16, ARRAY_SIZE,         CHUNK_SIZE})
    ->Args({16, SCALE * ARRAY_SIZE, CHUNK_SIZE})
    ->Args({16, ARRAY_SIZE,         SCALE * CHUNK_SIZE});

BENCHMARK_MAIN();

I only have access to Compiler Explorer at the moment which will not execute the complete suite of benchmarks. However, it looks like increasing the chunk size will improve the performance. Obviously, benchmark and optimize for your own system.

Why doesn't my OpenMP program scale with number of threads?

3 Answers3