Can gcc make my code parallel?

Question

I was wondering if there is an optimization in gcc that can make some single-threaded code like the example below execute in parallel. If no, why? If yes, what kind of optimizations are possible?

#include <iostream>

int main(int argc, char *argv[])
{
    int array[10];
    for(int i = 0; i < 10; ++ i){
        array[i] = 0;
    }
    for(int i = 0; i < 10; ++ i){
        array[i] += 2;
    }
    return 0;
}

Added:

Thanks for OpenMP links, and as much as I think it's useful, my question is related to compiling same code without the need to rewrite smth. So basically I want to know if:

Making code parallel(at least in some cases) without rewriting it is possible?
If yes, what cases can be handled? If not, why?

Ummmm... What are you trying to do here? You're incrementing the `array` itself, rather than the values in it. — Sebastian Lenartowicz, Oct 17 '16 at 13:48
You have to explain to the compiler what you want your program to do. Technically it could make your executable spawn threads as long as the end result were guaranteed to be what you wrote, but good luck getting a compiler to do that for the fun of it ;) — Lightness Races in Orbit, Oct 17 '16 at 13:49
@SebastianLenartowicz sorry, edited it, now it should be better — Liberus, Oct 17 '16 at 13:51
@LightnessRacesinOrbit or it could just execute parallel version of asm commands(if there are some) — Liberus, Oct 17 '16 at 13:53
@LightnessRacesinOrbit In my case, it's quite obvious what I want to do, and it can be parallelized I believe — Liberus, Oct 17 '16 at 14:06
@Liberus: But a trivial case like this is not worth the eight decades of research it'd take to implement the feature for general use, nor frankly is it worth parallelising in the first place! OpenMP is probably the closest you can come to what you're asking, as far as I'm aware. — Lightness Races in Orbit, Oct 17 '16 at 14:22
I was just going to mention that. openmp would not be useful for code that is so simple. If the array bounds was 10s of thousands maybe. but 10 is way to small unless the work done per iteration is much larger. — drescherjm, Oct 17 '16 at 14:24
The reason for that is the cost of creating threads would be much larger than the amount of time saved. — drescherjm, Oct 17 '16 at 14:26
@LightnessRacesinOrbit as RobClucas said in his answer, gcc can just decide to use some Single instruction multiple data commands, don't think it took 8 decades of research for that. But I heard your point — Liberus, Oct 17 '16 at 14:28

score 12 · Accepted Answer · answered Oct 17 '16 at 14:05

The compiler can try to automatically parallelise your code, but it wont do it by creating threads. It may use vectorised instructions (intel intrinsics for an intel CPU, for example) to operate on multiple elements at a time, where it can detect that using those instructions is possible (for example when you perform the same operation multiple times on consecutive elements of a correctly aligned data structure). You can help the compiler by telling it which intrinsic instruction set your CPU supports (-mavx, -msse4.2 ... for example).

You can also use these instructions directly, but it requires a non-trivial amount of work for the programmer. There are also libraries which do this already (see the vector class here Agner Fog's blog).

You can get the compiler to auto-parallelise using multiple threads by using OpenMP (OpenMP introducion), which is more instructing the compiler to auto-parallelise, than the compiler auto-parallelising by itself.

Thanks! That's something. I'll wait for other responses for a while and if you are the only one, I'll mark it. — Liberus, Oct 17 '16 at 14:11
Note that this is a particular type of parallelism, known as data-parallel. In comparison, threads are considered code-parallel. Alternative terms are SIMD (Single Instruction Multiple Data) vs. MIMD (Multiple Instruction Multiple Data). Ordinary code is just SISD. MISD gives data races. — MSalters, Oct 17 '16 at 15:14

Peter Cordes · Answer 2 · 2020-02-10T03:58:02.973

11

Yes, gcc with -ftree-parallelize-loops=4 will attempt to auto-parallelize with 4 threads, for example.

I don't know how well gcc does at auto-parallelization, but it is something that compiler developers have been working on for years. As other answers point out, giving the compiler some guidance with OpenMP pragmas can give better results. (e.g. by letting the compiler know that it doesn't matter what order something happens in, even when that may slightly change the result, which is common for floating point. Floating point math is not associative.)

And also, only doing auto-parallelization for #pragma omp loops means only the really important loops get this treatment. -ftree-parallelize-loops probably benefits from PGO (profile-guided optimization) to know which loops are actually hot and worth parallelizing and/or vectorizing.

It's somewhat related to finding the kind of parallelism that SIMD can take advantage of, for auto-vectorizing loops. (Which is enabled by default at -O3 in gcc, and at -O2 in clang).

edited Feb 10 '20 at 03:58

answered Nov 01 '16 at 14:19

Peter Cordes

328,167
45
605
847

[C loop optimization help for final assignment (with compiler optimization disabled)](https://stackoverflow.com/q/32000917) has benchmarks for this on a Sandybridge CPU, for one specific problem. – Peter Cordes Sep 08 '21 at 12:38
How does `-ftree-parallelize-loops=n` differ from `-floop-parallelize-all`? – Geremia Mar 11 '22 at 18:24
@Geremia: IDK, haven't looked at how the GCC option names have evolved since I wrote this. – Peter Cordes Mar 11 '22 at 18:39
They're two different options. – Geremia Mar 11 '22 at 19:36

score 7 · Answer 3 · answered Oct 17 '16 at 14:33

Compilers are allowed to do whatever they want as long as the observable behavior (see 1.9 [intro.execution] paragraph 8) is identical to that specified by the [correct(*)] program. Observable behavior is specified in terms of I/O operations (using standard C++ library I/O) and access to volatile objects (although the compiler actually isn't really required to treat volatile objects special if it can prove that these aren't in observable memory). To this end the C++ execution system may employ parallel techniques.

Your example program actually has no observable outcome and compilers are good a constant folding programs to find out that the program actually does nothing. At best, the heat radiated from the CPU could be an indication of work but the amount of energy consumed isn't one of the observable effects, i.e., the C++ execution system isn't required to do that. If you compile the code above with clang with optimization turned on (-O2 or higher) it will actually entirely remove the loops (use the -S option to have the compiler emit assembly code to reasonably easy inspect the results).

Assuming you have actually loops which are forced to be executed, most contemporary compilers (at least, gcc, clang, and icc) will try to vectorize the code taking advantage of SIMD instructions. To do so, the compiler needs to comprehend the operations in the code to prove that parallel execution doesn't change the results or introduced data races (as far as I can tell, the exact results are actually not necessarily retained when floating point operations are involved as some of the compilers happily parallelize, e.g., loops adding floats although floating point addition isn't associative).

I'm not aware of a contemporary compiler which will utilize different threads of execution to improve the speed of execution without some form of hints like Open MP's pragmas. However, discussion at the committee meetings imply that compiler vendors are considering to so at least.

(*) The C++ standard imposes no restriction on the C++ execution system in case the program execution results in undefined behavior. Correct programs wouldn't invoke any form of undefined behavior.

tl;dr: compilers are allowed but not required to execute code in parallel and most contemporary compilers do so in some situations.

>However, discussion at the committee meetings imply that compiler vendors are considering to so at least. Can I read about that somewhere? Seems like an interesting topic for me. — Liberus, Oct 18 '16 at 06:38
You're correct, `-ffast-math` or OpenMP directives are needed to enable auto-vectorization of things like summing an array of floats (a reduction), because having 4 accumulators as elements of a vector, each seeing every 4th element, would give slightly different result because of having different intermediate values. re: auto-par: gcc does have an auto-parallelization option. I added an answer. I have the impression that it's so hard to do a good job that human decision making about where to parallelize is usually recommended. — Peter Cordes, Nov 01 '16 at 14:26

score 4 · Answer 4 · edited May 23 '17 at 12:24

If you want to parallelize your c++ code, you can use openmp. Official documentation can be found here : openmp doc

Openmp provides pragmas so that you can indicate to the compiler that a portion of code has to use a certain number of threads. Sometimes you can do it manually, and some other pragmas can automatically optimize the number of cores used.

The code below is an example of the official documentation :

#include <cmath>
int main() {
    const int size = 256;
    double sinTable[size];

    #pragma omp parallel for
    for(int n=0; n<size; ++n) {
        sinTable[n] = std::sin(2 * M_PI * n / size);
    }
}

This code will automatically parallelize the for loop, this answers your question. There are a lot of other possibilities offered by openmp, you can read the documentation to learn more.

If you need to understand compiling for openmp support, see this stack overflow thread : openmp compilation thread.

Be careful, If you don't use openmp specific options, pragmas will simply be ignored and your code will be run on 1 thread.

I hope this helps.

Can gcc make my code parallel?

4 Answers4

Linked