Multithreaded function performance worse than single threaded

Question

I wrote an update() function which ran on a single thread, then I wrote the below function updateMP() which does the same thing except I divide the work in my two for loops here amongst some threads:

void GameOfLife::updateMP()
{
    std::vector<Cell> toDie;
    std::vector<Cell> toLive;

#pragma omp parallel
    {
        // private, per-thread variables
        std::vector<Cell> myToDie;
        std::vector<Cell> myToLive;

#pragma omp for
        for (int i = 0; i < aliveCells.size(); i++) {
            auto it = aliveCells.begin();
            std::advance(it, i);

            int liveCount = aliveCellNeighbors[*it];
            if (liveCount < 2 || liveCount > 3) {
                myToDie.push_back(*it);
            }
        }

#pragma omp for
        for (int i = 0; i < aliveCellNeighbors.size(); i++) {
            auto it = aliveCellNeighbors.begin();
            std::advance(it, i);

            if (aliveCells.find(it->first) != aliveCells.end()) // is this cell alive?
                continue; // if so skip because we already updated aliveCells

            if (aliveCellNeighbors[it->first] == 3) {
                myToLive.push_back(it->first);
            }
        }

#pragma omp critical
        {
            toDie.insert(toDie.end(), myToDie.begin(), myToDie.end());
            toLive.insert(toLive.end(), myToLive.begin(), myToLive.end());
        }
    }

    for (const Cell& deadCell : toDie) {
        setDead(deadCell);
    }

    for (const Cell& liveCell : toLive) {
        setAlive(liveCell);
    }
}

I noticed that it performs worse than the single threaded update() and seems like it's getting slower over time.

I think I might be doing something wrong by having two uses of omp for? I am new to OpenMP so I am still figuring out how to use it.

Why am I getting worse performance with my multithreaded implementation?

EDIT: Full source here: https://github.com/k-vekos/GameOfLife/tree/hashing?files=1

Are you compiling with optimizations and what are you using to time it? — William Miller, Feb 27 '19 at 21:34
Often this is because the algorithm didn't parallellize smoothly. The multiple threads fight it out over a common resource and waste more time than they save. Another common one is one is the overhead of threading exceeds the runtime of the algorithm. — user4581301, Feb 27 '19 at 21:39
@WilliamMiller Yes, I'm using Release build config, here is the command line: `/GS /GL /analyze- /W3 /Gy /Zc:wchar_t /I"C:\SFML-2.5.1\include" /Zi /Gm- /O2 /sdl /Fd"Release\vc141.pdb" /Zc:inline /fp:precise /D "SFML_STATIC" /D "_MBCS" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /openmp /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\GameOfLife.pch" /diagnostics:classic` I am timing it by eye, the difference is that noticeable. — Kyle V., Feb 27 '19 at 21:39
How long does this part of the operation take to run for a single iteration? It is entirely possible that the creation of the threads dominates the runtime, as @user4581301 is alluded — William Miller, Feb 27 '19 at 21:41
I don't know openmp, but this doesn't look like an [mcve] to me. — xaxxon, Feb 27 '19 at 22:08
Each step of Conway's Game of Life is fast, so you're probably spending more time synchronizing than executing code. — Mooing Duck, Feb 28 '19 at 00:41
@KyleV. the full [mcve] needs to be in the question. links are not acceptable for the fundamental contents of the question. Links can go away, but the post needs to continue to be useful for people coming to the question later for similar issues. — xaxxon, Feb 28 '19 at 07:11
What is the serial version you compare it to? Does it use the same code without enabling OpenMP or is the code different? `std::advance` on the iterator of an `std::unordered_set` in a loop is very bad. It has linear complexity because it is not a `RandomAccessIterator` - thus the loop has quadratic complexity versus a serial implementation that - presumably - does not use `std::advance`. — Zulan, Feb 28 '19 at 07:46
You parallelize every iteration, when every iteration is just one pixel, meaning that it just checks surrounding neighbours and decides whether to turn the pixel on or off. That's nothing. You most likely spend more time switching between threads than actually solving the problem. Consider parallelizing over lines, or better yet, bigger 2D chunks of your grid (for example 8 lines for a thread), that will result in less thread switches and better throughput. — Purple Ice, Mar 01 '19 at 10:11
The compiler you have chosen will analyze the sequential version of the code to determine whether the loops may be fused (combined into a single loop). The compiler optimization reports will show this. If this happens, combining the loops will be even more important with parallelization, but the parallelization directives prevent it. — tim18, Mar 01 '19 at 11:32

score 2 · Answer 1 · answered Mar 01 '19 at 06:20

Why am I getting worse performance with my multithreaded implementation?

Classic question :)

You loop through only the alive cells. That's actually pretty interesting. A naive implementation of Conway's Game of Life would look at every cell. Your version optimizes for a fewer number of alive than dead cells, which I think is common later in the game. I can't tell from your excerpt, but I assume it trades off by possibly doing redundant work when the ratio of alive to dead cells is higher.

A caveat of omp parallel is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section. It's implementation dependent. I can't seem to find any information on MSVC's implementation. If anyone knows, please weight in.

So that means that your threads could be created/destroyed every update loop, which is heavy overhead. For this to be worth it, the amount of work should be orders of magnitude more expensive than the overhead.

You can profile/measure the code to determine overhead and work time. It should also help you see where the real bottlenecks are.

Visual Studio has a profiler with a nice GUI. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.

Use high_resolution_clock to time sections that are hard to measure with the profiler.

If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each. They're so new that I barely know anything about them (they might also have the same issues as OpenMP).

seems like it's getting slower over time.

Could it be that you're not cleaning up one of your vectors?

Whether implementation specific speculations are correct or not, this is the real answer. Even if threads get reused, game of life is so simple that switching between threads might take more time than checking a single cell. — Purple Ice, Mar 01 '19 at 10:17

AlexG · Answer 2 · 2019-02-28T14:12:06.033

First off, if you want any kind of performance, you must do as little work as possible in a critical section. I'd start by changing the following:

std::vector<Cell> toDie;
std::vector<Cell> toLive;

to

std::vector<std::vector<Cell>> toDie;
std::vector<std::vector<Cell>> toLive;

Then, in your critical section, you can just do:

toDie.push_back(std::move(myToDie));
toLive.push_back(std::move(myToLive));

Arguably, a vector of vector isn't cute, but this will prevent deep-copying inside the CS which is unnecessary time consumption.

[Update] IMHO there's no point in using multithreading if you are using non-contiguous data structures, at least not in that way. The fact is you'll spend most of your time waiting on cache misses because that's what associative containers do, and little doing the actual work. I don't know how this game works. It feels like if I had to do something with numerous updates and render, I would have the updates done as quickly as possible on the 'main' thread', and I would have another (detached) thread for the renderer. You could then just give the renderer the results after every "update", and perform another update while it's rendering.

Also, I'm definitely not an expert in hashing, but hash<int>()(k.x * 3 + k.y * 5) seems like a high-collision hashing. You can certainly try something else like what's proposed here

"...not be optimal if you define an iterator inside the loop" Is it possible to iterate an `unordered_map` using a for loop with an index? I thought I could only use the `[]` operator by passing in an object of the templated type. Also, sorry, added a link to full source code in my OP. — Kyle V., Feb 28 '19 at 03:25

Multithreaded function performance worse than single threaded

2 Answers2