Why does OpenMP speed up a SINGLE-ITERATION loop?

Question

I'm using the "read" benchmark from Why is writing to memory much slower than reading it?, and I added just two lines:

#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)

They should have no effect, because OpenMP should only parallelize the outer loop, but the code now consistently runs twice faster.

Update: These lines aren't even necessary. Simply adding

omp_get_num_threads();

(implicitly declared) in the same place has the same effect.

Complete code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

unsigned long do_xor(const unsigned long* p, unsigned long n)
{
    unsigned long i, x = 0;

    for(i = 0; i < n; ++i)
        x ^= p[i];
    return x;
}

int main()
{
    unsigned long n, r, i;
    unsigned long *p;
    clock_t c0, c1;
    double elapsed;

    n = 1000 * 1000 * 1000; /* GB */
    r = 100; /* repeat */

    p = calloc(n/sizeof(unsigned long), sizeof(unsigned long));

    c0 = clock();

#pragma omp parallel for
    for(unsigned dummy = 0; dummy < 1; ++dummy) 
    for(i = 0; i < r; ++i) {
        p[0] = do_xor(p, n / sizeof(unsigned long)); /* "use" the result */
        printf("%4ld/%4ld\r", i, r);
        fflush(stdout);
    }

    c1 = clock();

    elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;

    printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);

    free(p);
}

Compiled and executed with

gcc -O3 -Wall -fopenmp single_iteration.c && time taskset -c 0 ./a.out

The wall time reported by time is 3.4s vs 7.5s.

GCC 7.3.0 (Ubuntu)

I can reproduce. It's not optimizing away or the issue with `clock`. Nevertheless I suggest you can improve the question by measuring wall time and add the actual measurement results as well as your CPU & memory specification. — Zulan, Mar 11 '19 at 10:03

score 1 · Accepted Answer · answered Mar 11 '19 at 12:38

The reason for the performance difference is not actually any difference in code, but in how memory is mapped. In the fast case you are reading from zero-pages, i.e. all virtual addresses are mapped to a single physical page - so nothing has to be read from memory. In the slow case, it is not zeroed. For details see this answer from a slightly different context.

On the other side, it is not caused by calling omp_get_num_threads or the pragma itstelf, but merely linking to the OpenMP runtime library. You can confirm that by using -Wl,--no-as-needed -fopenmp. If you just specify -fopenmp but don't use it at all, the linker will omit it.

Now unfortunately I am still missing the final puzzle piece: why does linking to OpenMP change the behavior of calloc regarding zero'd pages .

`-fopenmp` probably uses a different (thread-safe) version of `calloc`? — MWB, Mar 14 '19 at 16:49

Why does OpenMP speed up a SINGLE-ITERATION loop?

1 Answers1