Unexpectedly good performance with openmp parallel for loop

Question

I have edited my question after previous comments (especially @Zboson) for better readability

I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here)

Here is my test code:

#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main() {
    const int n = 256*8192*100;
    double *A, *B;
    posix_memalign((void**)&A, 64, n*sizeof(double));
    posix_memalign((void**)&B, 64, n*sizeof(double));
    for (int i = 0; i < n; ++i) {
        A[i] = 0.1;
        B[i] = 0.0;
    }
    double start = omp_get_wtime();
    #pragma omp parallel for
    for (int i = 0; i < n; ++i) {
        B[i] = exp(A[i]) + sin(B[i]);
    }
    double end = omp_get_wtime();
    double sum = 0.0;
    for (int i = 0; i < n; ++i) {
        sum += B[i];
    }
    printf("%g %g\n", end - start, sum);
    return 0;
}

When I compile it using gcc 4.9-4.9-20140209, with the command: gcc -Ofast -march=native -std=c99 -fopenmp -Wa,-q I see the following performance as I change OMP_NUM_THREADS [the points are an average of 5 runs, the error bars (which are hardly visible) are the standard deviations]: Performance as a function of thread count

The plot is clearer when shown as the speed up with respect to OMP_NUM_THREADS=1: Speed up as a function of thread count

The performance more or less monotonically increases with thread number, even when the the number of omp threads very greatly exceeds the core and also hyper-thread count! Usually the performance should drop off when too many threads are used (at least in my previous experience), due to the threading overhead. Especially as the calculation should be cpu (or at least memory) bound and not waiting on I/O.

Even more weirdly, the speed-up is 35 times!

Can anyone explain this?

I also tested this with much smaller arrays 8192*4, and see similar performance scaling.

In case it matters, I am on Mac OS 10.9 and the performance data where obtained by running (under bash):

for i in {1..128}; do
    for k in {1..5}; do
        export OMP_NUM_THREADS=$i;
        echo -ne $i $k "";
        ./a.out;
    done;
done > out

EDIT: Out of curiosity I decided to try much larger numbers of threads. My OS limits this to 2000. The odd results (both speed up and low thread overhead) speak for themselves! Crazy numbers of threads

EDIT: I tried @Zboson latest suggestion in their answer, i.e. putting VZEROUPPER before each math function within the loop, and it did fix the scaling problem! (It also sent the single threaded code from 22 s to 2 s!):

correct scaling

It may be how indeed OpenMP is assigning the threads, have you tried 3 threads just out of curiosity? It could be that when moving from 1 to 2, that it is assigning both threads to a single ACTUAL core, but because you are truly trying to utilize the same resources within that single core, that it really isn't helping! When moving to 4, you are truly utilizing 2 actual cores (maybe). Also, what happens if you use 8 threads, so we can see what happens when we move from (hopefully) a hyperthread situation to a full core situation + hyperthreads? — trumpetlicks, Feb 22 '14 at 20:44
Also, if you to multiple runs of each (with the exception of the single case), what do the timings come out to. I think that OpenMP and the OS randomly assign to core # (or in your case it could be assigning to a HT or actual core). — trumpetlicks, Feb 22 '14 at 21:11
For accurate benchmarking you want to run each benchmark at least 20 times, there are a few whitepapers about that around. On which OS are you running this? I assume OpenMP leaves the scheduling up to the OS and if that one's not HT aware that would easily explain the behavior. (Also since gcc most likely correctly unrolls and vectorizes the code that explains why you don't see any significant difference between 4 and 8 threads - HT can even be detrimental in such a situation; strange thing about 8->16 though). — Voo, Feb 23 '14 at 01:22
Don't use clock(). It does not return the wall time on Linux (but it does on Windows). Use `omp_get_wtime()`. — Z boson, Feb 23 '14 at 07:22
@Voo gcc won't vectorize anything using libc functions. The issue is indeed OS thread scheduling strategy. — Joel Falcou, Feb 23 '14 at 07:58
@Joel `sin` and most other math functions use compiler builtins for performance reasons, haven't tested for `exp` but I assume same thing there. If we really get a call overhead for every tiny function here, then I'd expect HT to profit the code which doesn't seem to be the case. — Voo, Feb 23 '14 at 10:18
compiler builtins are one thing, vectorized versionof those are another one. — Joel Falcou, Feb 23 '14 at 10:36
@Voo do you have a reference? Usually 5 times is sufficient for a crude test of the variation in most statistical systems. — jtravs, Feb 23 '14 at 10:40
@jtravs, I'm not really surprised you seem some boost for threads 5-8 but it's perhaps more than I would expect. The math libraries are not optimized and probably have a lot of CPU stalls which is where HT helps. But I don't understand why you see an improvement past 8 threads. I'll test this code tomorrow. BTW, nice plots! It's refreshing to see plots with axis labels and even error bars. — Z boson, Feb 23 '14 at 13:01
@jtravs, your first plot of the average runtime would be better as a log plot (log of the average run time). — Z boson, Feb 23 '14 at 19:53
@jtravs, I added some new information to my answer that I think you should consider. Basically, I think you should called `_mm256_zeroupper()` before the `exp` and `sin` function. — Z boson, Feb 23 '14 at 20:55
@jtravs, just to warn you, because nobody told me before I learned the hard way, in SO if you make more than 10 edits to a post it becomes community wiki. That means you no longer get credit for up votes and other people can edit your post easier. It's rather stupid because you're making your question better but that's the way it is. — Z boson, Feb 23 '14 at 21:08

score 11 · Accepted Answer · edited May 23 '17 at 12:24

11

The problem is likely due to the clock() function. It does not return the wall time on Linux. You should use the function omp_get_wtime(). It's more accurate than clock and works on GCC, ICC, and MSVC. In fact I use it for timing code even when I'm not using OpenMP.

I tested your code with it here http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2

Edit: Another thing to consider which may be causing your problem is that exp and sin function which you are using are compiled WITHOUT AVX support. Your code is compiled with AVX support (actually AVX2). You can see this from GCC explorer with your code if you compile with -fopenmp -mavx2 -mfma Whenever you call a function without AVX support from code with AVX you need to zero the upper part of the YMM register or pay a large penalty. You can do this with the intrinsic _mm256_zeroupper (VZEROUPPER). Clang does this for you but last I checked GCC does not so you have to do it yourself (see the comments to this question Math functions takes more cycles after running any intel AVX function and also the answer here Using AVX CPU instructions: Poor performance without "/arch:AVX"). So every iteration you are have a large delay due to not calling VZEROUPPER. I'm not sure why this is what matters with multiple threads but if GCC does this each time it starts a new thread then it could help explain what you are seeing.

#include <immintrin.h>

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    _mm256_zeroupper();
    B[i] = sin(B[i]);
    _mm256_zeroupper();
    B[i] += exp(A[i]);       
}

Edit A simpler way to test do this is to instead of compiling with -march=native don't set the arch (gcc -Ofast -std=c99 -fopenmp -Wa) or just use SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa).

Edit GCC 4.8 has an option -mvzeroupper which may be the most convenient solution.

This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.

edited May 23 '17 at 12:24

Community

1
1

answered Feb 23 '14 at 07:54

Z boson

32,619
11
123
226

time what you ahve to time. Warming up just make sure you forgot to take into account the cost of OpenMP, which is misleading. The coost is the cost, live with it. – Joel Falcou Feb 23 '14 at 07:57
I could argue that not warming up is misleading. If you're going to use your function several times and you only report the time staring cold then that's misleading. It's best to report a worst case and best case time. That's more accurate. – Z boson Feb 23 '14 at 08:02
@JoelFalcou, to give you an example. I render the Mandelbrot set several frames per second using OpenMP. The first frame is always the slowest one due to OpenMP warming up. It's not just a question of the cache because I can change what I render (zoom, translate) and go back to the initial setting and it's only the first frame which is so slow. If I only reported the time for the first frame it would be misleading. In this case the best case time is more accurate. – Z boson Feb 23 '14 at 08:15
usually the best way to do that is to run a *large* amount of samples then take the median or the first-decile values. Also cache issues is non existant in Mandelbrodt anyway as you only store valeu to tyour destination buffer. So yeah, the first frame is slow becasue of thread starting up + cache beign cold. Median time sis better for that as it remove all outliers and not only the first. – Joel Falcou Feb 23 '14 at 10:38
@Zboson I only wanted to parallelize one loop as I was comparing the same kernel calculation over many different languages/systems. For the same reason I want to include all openmp overhead. – jtravs Feb 23 '14 at 10:42
@Zboson your comment about not using `clock()` was spot on (I actually had known that before, but temporarily forgot). However, the odd behaviour with num_threads >> num_real_hardware_threads is still unexplained. – jtravs Feb 23 '14 at 10:56
@Zboson I tried your new suggestion, and it does seem to fix the cpu scaling issue, but it gives the wrong numerical answer! And takes twice as long to run as the previous version. See my edit to the question. – jtravs Feb 23 '14 at 21:10
@jtravs, can you try compiling without AVX. I mean instead of using `-march=native` not set the arch or just use `-msse2` instead and see if the scaling issue goes away? – Z boson Feb 23 '14 at 21:15
Actually it was just an ordering error, we need to switch the two lines (as B[i] is currently assigned in the first one). If you edit your answer I'll mark it as correct. – jtravs Feb 23 '14 at 21:17
@jtravs, fixed it. Wow! This was an interesting problem! I had to think if so many things to figure this one out. Now I can say that the people telling me I'm wasting my time learning intrinsics and SIMD and just to learn OpenMP are wrong. I doubt they would have figured this one out :-) Thanks again for the really nice plots. What did you make them with? – Z boson Feb 23 '14 at 21:24
@Zboson python + matplotlib. BTW I'm seriously impressed that you worked this out! – jtravs Feb 23 '14 at 21:25
@jtravs, I used matplotlib in the past. I should use it again. BTW, I think you should add AVX as a tag to this question. You will probably have to remove one of the other ones. – Z boson Feb 23 '14 at 21:29
@jtravs, what is the speed of one thread before using zeroupper and then after the fix? After using zeroupper I think the single threaded code should be a lot faster. – Z boson Feb 23 '14 at 21:33
@Zboson without it took ~22 s, with it takes ~ 2 s! That is a very good tip to know! – jtravs Feb 23 '14 at 21:41
@jtravs, would you mind adding this information to you question? I mean state that calling VZEROUPPER before each math function fixed the scaling problem. Also state that the single threaded code went from 22s to 2s. Not everyone is going to read all our comments here. – Z boson Feb 23 '14 at 21:52
I'm not seeing that you have tried affinity options such as -OMP_PLACES=cores to spread 4 threads evenly across cores. If that setting doesn't work (e.g. on an older libgomp) GOMP_CPU_AFFINITY may work. Performance peaking at more threads than cores is typical of libgomp on Windows, where affinity settings are ignored. I'm assuming that you have adopted the vzeroupper fix if you continue to mix SSE and AVX. – tim18 Nov 14 '15 at 19:47
@tim18, was this comment suppose to be directed to the OP? – Z boson Nov 14 '15 at 20:55
Even gcc 4.8.1 defaults to `-mvzeroupper` with `-march=native` on Haswell. I had to [use `-mno-vzeroupper` to see a change in the asm output](https://godbolt.org/g/6FZEPx). So it looks like this gotcha is basically fixed in newer gcc versions. – Peter Cordes May 26 '16 at 20:14
@PeterCordes, I never use `-march=native`. Maybe I should. I think of what I want to support and enable those e.g. `-mavx2 -mfma`. I think `-mvzeroupper` should be the default with `-mavx`. I think it is with Clang. – Z boson May 27 '16 at 09:05
@Zboson: `-march=sandybridge` sets `-mtune=sandybridge` as well as enabling `-mavx`, popcnt, crc, and so on. I picked Sandybridge as an example because it has an interesting tune setting: it prefers to do unaligned AVX loads/stores 16B at a time, with `movups xmm` / `vextractf128`. (This leads to sub-optimal code when initializing an array or something: instead of two `movups xmm` stores, it uses `vextractf128` to store the identical upper half. vextract is a longer insn, and more importantly can't micro-fuse. Normally it should be a win if data really is misaligned at runtime.) – Peter Cordes May 27 '16 at 10:52
@PeterCordes, so I'm a bit confused now. The OP used GCC 4.9 and `-march=native`. This should enable `-mvzeroupper` and yet the OPs problem went away only after explicitly using `_mm256_zeroupper()`. Maybe I was wrong in my updated I added much later that `-mvzeroupper` would fix the problem. Maybe `_mm256_zeroupper()` is still explicitly needed. – Z boson May 27 '16 at 11:18
@Zboson: I'm confused too. I didn't see any ymm instructions in the function that OpenMP starts for each thread. I thought maybe the `-mzeroupper` tune setting was changed between gcc 4.9 minor releases. (e.g. gcc4.9 from 2014 maybe didn't default to enabling it?) – Peter Cordes May 27 '16 at 11:59
@PeterCordes, here is my guess. `vzeroupper` is only being called for the master thread and not the other threads (each core or rather hyper thread needs `vzeroupper` I think). I just looked at the assembly and I think that's what I read. `vzeroupper` is called before `GOMP_parallel_start` so I think it's only for the master thread. This is interesting! If this is correct I would be suprised if Clang get's it right. If you look at my solution notice that I used `_mm256_zeroupper()` after the parallel region. – Z boson May 27 '16 at 13:02
But what happens in the child threads that ever transitions out of State A? There aren't any ymm ops in `main._omp_fn.0`. Mixing AVX-128 and SSE is fine. Can the kernel leave the CPU in state B or C on context switches? – Peter Cordes May 27 '16 at 13:11
Is there a way to get OpenMP to run something once in each child thread, outside the part of the parallelized loop? (In the asm output, a single `vzeroupper` outside the inner loop should do the trick. You could of course make this change by hand in the asm output as an experiment.) I predict one `_mm256_zeroupper()` in the `omp parallel` loop body should work (since there's nothing between the function calls for gcc to auto-vectorize). – Peter Cordes May 27 '16 at 13:12
@PeterCordes, this is beyond my knowledge now. But my hypothesis seems reasonable. I wonder if it's any better with GCC 5? It would be easy to do a little test. That would at least say if this problem has been fixed or if it's still ongoing in which case I would argue it's in some sense an OpenMP bug. – Z boson May 27 '16 at 13:14
@PeterCordes, yes that's easy `#pragma omp parallel` new line `{ _mm256_zeroupper();` new line `#pragma omp for ` new line `for(...) {} }`. That will call `_mm256_zeroupper()` once for each thread rather than for each iteration. – Z boson May 27 '16 at 13:16
BTW, storing to `B[i]` as a temporary appears sub-optimal with gcc5.3, due to failure of alias analysis or something. Or maybe `_mm256_zeroupper()` accidentally triggers a compiler memory barrier, forcing it to reload pointers afterwards? Anyway, https://godbolt.org/g/5YXl25. Thanks for the OMP syntax, but I'm working on something else ATM. – Peter Cordes May 27 '16 at 13:22
@PeterCordes, I finally discovered "Short link" on godbolt. That's useful! Here is what I meant https://godbolt.org/g/uGBgtz. – Z boson May 27 '16 at 14:34

Unexpectedly good performance with openmp parallel for loop

1 Answers1

Linked