Parallelize output using OpenMP

Question

I've written a C++ app that has to process a lot of data. Using OpenMP I parallelized the processing phase quite well and, embarrassingly, found that the output writing is now the bottleneck. I decided to use a parallel for there as well, as the order in which I output items is irrelevant; they just need to be output as coherent chunks.

Below is a simplified version of the output code, showing all the variables except for two custom iterators in the "collect data in related" loop. My question is: is this the correct and optimal way to solve this problem? I read about the barrier pragma, do I need that?

long i, n = nrows();

#pragma omp parallel for
for (i=0; i<n; i++) {
    std::vector<MyData> related;
    for (size_t j=0; j < data[i].size(); j++)
        related.push_back(data[i][j]);
    sort(related.rbegin(), related.rend());

    #pragma omp critical
    {
        std::cout << data[i].label << "\n";
        for (size_t j=0; j<related.size(); j++)
            std::cout << "    " << related[j].label << "\n";
    }
}

(I labeled this question c as I imagine OpenMP is very similar in C and C++. Please correct me if I'm wrong.)

score 8 · Accepted Answer · answered Nov 05 '10 at 15:50

One way to get around output contention is to write the thread-local output to a string stream, (can be done in parallel) and then push the contents to cout (requires synchronization).

Something like this:

#pragma omp parallel for
for (i=0; i<n; i++) {
    std::vector<MyData> related;
    for (size_t j=0; j < data[i].size(); j++)
        related.push_back(data[i][j]);
    sort(related.rbegin(), related.rend());

    std::stringstream buf;
    buf << data[i].label << "\n";
    for (size_t j=0; j<related.size(); j++)
        buf << "    " << related[j].label << "\n";

    #pragma omp critical
    std::cout << buf.rdbuf();
}

This offers much more fine-grained locking and the performance should increase accordingly. On the other hand, this still uses locking. So another way would be to use an array of stream buffers, one for each thread, and pushing them to cout sequentially after the parallel loop. This has the advantage of avoiding costly locks, and the output to cout must be serialized anyway.

On the other hand, you can even try to omit the critical section in the above code. In my experience, this works since the underlying streams have their own way of controlling concurrency. But I believe that this behaviour is strictly implementation defined and not portable.

score 2 · Answer 2 · answered Nov 05 '10 at 14:53

2

cout contention is still going to be a problem here. Why not output the results in some thread-local storage and collate them to the desired location centrally, meaning no contention. For example, you could have each target thread for the parallel code write to a separate filestream or memory stream and just concatenate them afterwards, since ordering is not important. Or postprocess the results from multiple places instead of one - no contention, single write only required.

answered Nov 05 '10 at 14:53

Steve Townsend

53,498
9
91
140

yes, access to the `FILE` structure underlying `cout` should be implicitly mutexed with `flockfile`, `omp parallel for` will gain nothing, here. – Jens Gustedt Nov 05 '10 at 15:23
@larsmans - good luck - in general `omp critical` is a yellow flag for perf, and in this case since it's I/O logic, doubly so. – Steve Townsend Nov 05 '10 at 15:48
@Jens Gustedt, the parallel for is there because of the sorting phase, which I imagined would take lots of CPU time (and have replaced with a `partial_sort_copy`). – Fred Foo Nov 07 '10 at 21:24

Parallelize output using OpenMP

2 Answers2

Linked

Related