OpenMP parallel for does not speed up array sum code

Question

I'm trying to test the speed up of OpenMP on an array sum program.

The elements are generated using random generator to avoid optimization.

The length of array is also set large enough to indicate the performance difference.

This program is built using g++ -fopenmp -g -O0 -o main main.cpp, -g -O0 are used to avoid optimization.

However OpenMP parallel for code is significant slower than sequential code.

Test result:

Your thread count is: 12
Filling arrays
filling time：66718888
Now running omp code
2thread omp time:11154095
result: 4294903886
Now running omp code
4thread omp time:10832414
result: 4294903886
Now running omp code
6thread omp time:11165054
result: 4294903886
Now running sequential code
sequential time: 3525371
result: 4294903886

#include <iostream>
#include <stdio.h>
#include <omp.h>
#include <ctime>
#include <random>

using namespace std;

long long llsum(char *vec, size_t size, int threadCount) {
    long long result = 0;
    size_t i;
#pragma omp parallel for num_threads(threadCount) reduction(+: result) schedule(guided)
    for (i = 0; i < size; ++i) {
        result += vec[i];
    }
    return result;
}

int main(int argc, char **argv) {
    int threadCount = 12;
    omp_set_num_threads(threadCount);
    cout << "Your thread count is: " << threadCount << endl;
    const size_t TEST_SIZE = 8000000000;
    char *testArray = new char[TEST_SIZE];
    std::mt19937 rng;
    rng.seed(std::random_device()());
    std::uniform_int_distribution<std::mt19937::result_type> dist6(0, 4);
    cout << "Filling arrays\n";
    auto fillingStartTime = clock();
    for (int i = 0; i < TEST_SIZE; ++i) {
        testArray[i] = dist6(rng);
    }
    auto fillingEndTime = clock();
    auto fillingTime = fillingEndTime - fillingStartTime;
    cout << "filling time：" << fillingTime << endl;

    // test omp time
    for (int i = 1; i <= 3; ++i) {
        cout << "Now running omp code\n";
        auto ompStartTime = clock();
        auto ompResult = llsum(testArray, TEST_SIZE, i * 2);
        auto ompEndTime = clock();
        auto ompTime = ompEndTime - ompStartTime;
        cout << i * 2 << "thread omp time:" << ompTime << endl << "result: " << ompResult << endl;
    }

    // test sequential addition time
    cout << "Now running sequential code\n";
    auto seqStartTime = clock();
    long long expectedResult = 0;
    for (int i = 0; i < TEST_SIZE; ++i) {
        expectedResult += testArray[i];
    }
    auto seqEndTime = clock();
    auto seqTime = seqEndTime - seqStartTime;
    cout << "sequential time: " << seqTime << endl << "result: " << expectedResult << endl;

    delete[]testArray;
    return 0;
}

C or C++? Pick **one** - C and C++ are two different languages. — Andrew Henle, Oct 17 '22 at 16:29
Use `omp_get_wtime()` to time OpenMP programs. There are several Qs and As on this site explaining why you should. — High Performance Mark, Oct 17 '22 at 16:39
Also: chop a bunch of digits off those timing numbers. You need only 2 or 3 significant digits. This is unreadable. Also also: try 2,3,4 threads. Maybe the maximal thread count does not work well, but a lower count may.(Are you sure you have 12 cores? Not 6 and 2 hyperthreads?) — Victor Eijkhout, Oct 17 '22 at 17:15
I think `clock` does not measure what you think it does. Please do not use it or read the documentation very carefully. — Jérôme Richard, Oct 17 '22 at 17:30
Note that `8000000000` is bigger than the maximum value for a variable of type `int`, so you should not use `int` as a loop variable. — Laci, Oct 17 '22 at 18:46
@VictorEijkhout Why use `-O2`? "avoid optimization" means I do not want some weird compiler optimization to accelerate serial addition, so that I can better see the speed-up of OpenMP parallel execution. — Name Null, Oct 18 '22 at 07:22

Name Null · Accepted Answer · 2022-10-18T15:30:43.130

0

As pointed out by @High Performance Mark, I should use omp_get_wtime() instead of clock().

clock() is 'active processor time', not 'elapsed time.

See

After using omp_get_wtime(), and fixing the int i to size_t i, the result is more meaningful:

Your thread count is: 12
Filling arrays
filling time：267.038
Now running omp code
2thread omp time:26.1421
result: 15999820788
Now running omp code
4thread omp time:7.16911
result: 15999820788
Now running omp code
6thread omp time:5.66505
result: 15999820788
Now running sequential code
sequential time: 30.4056
result: 15999820788

edited Oct 18 '22 at 15:30

answered Oct 18 '22 at 07:20

Name Null

390
2
11

Which processor do you use? What do you get if you remove the `schedule(guided)` clause and use `-O3` and `-mavx2` (or `-mavx512`) flags? On my computer (Ryzen 7 5800U, g++ 11.2), the measured speed-up (2 threads vs. sequential) is about 1.95-1.98 in this case (very close to the theoretical maximum, which suggests that turbo boost is not so efficient in my case). – Laci Oct 18 '22 at 08:38
@Laci Using `-O3 ` instead of `-g -O0` and deleting `schedule(guided)`, speed-up of 2 thread is very limited (10-20%), 4 thread and 6 thread speed-up is significant (but not theoretical value (4 times or 6 times). My computer is `g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0`, i5-10400, ubuntu20.04 on a VM – Name Null Oct 18 '22 at 15:33

OpenMP parallel for does not speed up array sum code

1 Answers1