5

I'm learning c++ and was making a realtime raytracer. First I used std::thread to spread the work, but turned out starting 32 threads every frame is way slower than the actual work that needs to be done.

Then I found C++ also uses threadpools to deal with that issue through std::async():

void trace()
{
    constexpr unsigned int thread_count = 32;
    std::future<void> tasks[thread_count];

    for (auto i = 0u; i < thread_count; i++)
            tasks[i] = std::async(std::launch::async, raytrace_task(sample_count, world_), 10, 10);

    for (auto i = 0u; i < thread_count; i++)
        tasks[i].wait();
}

and the raytrace_task is empty:

struct raytrace_task
{
    // simple ctor omitted for brevity

    void operator()(int y_offset, int y_count)
    {
        
    }
}

But this is just as slow as making your own threads. Each call to trace() takes about 30ms! Can anyone tell me what I'm doing wrong or how to reuse threads? aka: post many data-processing jobs to a single reused thread throughout time.

thank you

Thomas
  • 432
  • 3
  • 13
  • 5
    "_starting 32 threads every frame_" - Yeah, that is exactly how to _not_ use threads. – Ted Lyngmo Apr 08 '21 at 20:18
  • 1
    ditch the async and future. Create the threads once at startup. Have them wait on some semaphore or signal. Each thread would then render the rows of pixels modulo n. – Jeffrey Apr 08 '21 at 20:18
  • 1
    threads cost time to spin up. Even thread pools needs extra time. If your function is fast enough, it's faster to just run it serially then to try and parallelize it. – NathanOliver Apr 08 '21 at 20:18
  • 6
    It is not guaranteed `std::async` will use thread pools. – Galik Apr 08 '21 at 20:19
  • 1
    `std::for_each(std::execution::par, tasks.begin(), tasks.end(), [](auto& task) { work with task });` may be an option. I've gotten good performance out of that. – Ted Lyngmo Apr 08 '21 at 20:22
  • @Jeffrey I started doing that when I discovered async. As I come from .NET where async is easy peasy, I figured it would be better than writing some leaky sync-code myself. It's starting to look like I will give that a try anyway. – Thomas Apr 08 '21 at 20:23
  • 1
    @TedLyngmo aha, that sounds promising.. I'll look it up. I prefer not writing threadsync code myself unless I have to :) – Thomas Apr 08 '21 at 20:24
  • @Thomas You'll just have to obey some rules when using it but it'll be clear when you read about it - and make sure that the `task` objects can be accessed without any synchronization (for speed). You can also make the `task` objects "big" (see [`std::hardware_destructive_interference_size`](https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size)) to prevent false sharing. That often crams out a few % more. Oh, and you need to use C++17 or later. – Ted Lyngmo Apr 08 '21 at 20:29
  • `std::async` is *optionally* implemented as a thread pool, but as far as I know the existence `thread_local` makes it basically impossible for an implementation to actually choose to use a thread pool. `std::async` would need to destroy and reinitialize all of its worker thread's `thread_local` objects for each job. – François Andrieux Apr 08 '21 at 20:30
  • @Thomas, By async in .NET, do you mean its general concurrency utilities or specifically `async` and `await` in C#? The latter is for coroutines, which are a very different beast. – chris Apr 08 '21 at 20:39
  • @chris i meant .NET's TPL in general. creating Tasks that run on pooled threads in the background. – Thomas Apr 08 '21 at 20:44
  • @FrançoisAndrieux thanks. that explains why it's just as slow as making my own threads. I'll look elsewhere :) – Thomas Apr 08 '21 at 20:45
  • It's not what you want to hear, but the reality is that C++ has very poor support for asynchronous programming. You'll need to turn to third-party libraries or craft your own. – GManNickG Apr 08 '21 at 20:47
  • @TedLyngmo thanks. I cannot find the concurrent::par, it should be in header concurrent, but it's not there. I figure im not compiling C++17 yet. Will have to investigate. And I read about objects to be big because of the L1 cache, yes. I'm keeping it in mind. regards – Thomas Apr 08 '21 at 20:47
  • The header is `` and the parameter is `std::execution::par` – Ted Lyngmo Apr 08 '21 at 20:48
  • The `std::async()` stuff is only three years old. You will have to look at your compiler (or standard library implementation) documentation to understand how it is actually implemented or the actual promises that it makes. Though it would be nice if they had created a good thread pool implementation it is not required (yet) by the standard. – Martin York Apr 08 '21 at 20:54
  • 1
    @TedLyngmo Yes, found it. The content of "execution" header didn't compile because i was compiling C++14, as you said. Fixed! Now, where was I ... :) – Thomas Apr 08 '21 at 20:55
  • Note: Even if a thread pool is used the implementation may lazily create the threads for the pool. So the first time they are used you will still pay the price it is just with extended use you would get the benifit. – Martin York Apr 08 '21 at 20:57
  • 1
    @MartinYork Makes sense, it's a realtime raytracer, so that would be ok. Frame 1 never wins prizes :) – Thomas Apr 08 '21 at 20:59
  • 3
    Here somebody actually looked at the implementation of all three major compilers: https://ddanilov.me/std-async-implementations/ (this was in 2020). Short: `clang & g++ both use threads` while `MSVC uses a pool` (as of the time of writting). – Martin York Apr 08 '21 at 21:08
  • @FrançoisAndrieux I am not up on the latest (and I probably misread slightly) just pointing out the article. – Martin York Apr 08 '21 at 21:09
  • 1
    Is this on Windows? If so, Windows rate-limits the speed at which it fires up new threads, see: https://stackoverflow.com/a/50898570/5743288 – Paul Sanders Apr 08 '21 at 21:21
  • @MartinYork That was an interesting article. When I started using `std::execution::par` I noticed a huge improvement in MSVC but not as much in g++ although they both use (or used, I don't know for sure) TBB as a backend. I don't see that big difference anymore. Perhaps something has changed under the hood. – Ted Lyngmo Apr 09 '21 at 05:58
  • @Thomas I made [a small example](https://godbolt.org/z/8z93fKaWK) of how I often use it. Perhaps you'll find something useful in it. – Ted Lyngmo Apr 09 '21 at 06:01
  • @TedLyngmo Thanks! I dug into the for_each implementation and - as far as I understand it - it looks like it's threadpooling only per-call to for_each, not across multiple calls? Which is unfortunately not a solution for me: it needs to reuse threads across all render-frames. Ill try it out tonight anyway, I may have overlooked something, that level of C++ code is still hard for me to read well. – Thomas Apr 09 '21 at 07:08
  • @Thomas If you've dug that deep you've looked at it more closely than I ever did. Perhaps the differences mentioned in the article that Martin linked to is what you've seen. A few years ago I compared it with my own version of a thread pool on MSVC and there `std::execution::par` won - but when using g++, my own pool won - which may still be the case, if as you say, it's recreating the pool on every `for_each` call. – Ted Lyngmo Apr 09 '21 at 07:16
  • @PaulSanders' [SO answer](https://stackoverflow.com/a/50898570/5743288) is also very interesting. Wow... Good thing I've never used `thread_local` variables when I've used `std::execution::par` in MSVC :-) – Ted Lyngmo Apr 09 '21 at 07:29
  • 1
    @TedLyngmo Did a quick test in my code using your example and although it stops speeding up at about 10 concurrent tasks, it's enough proof to say it does reuse threads across calls. About it not speeding up beyond 10 tasks: the raytracing code is not written with isolated data-buckets in mind yet, so it's probably just syncing a lot. I'll fix that. – Thomas Apr 09 '21 at 08:00
  • @TedLyngmo Can you create an answer with your example. It solved my problem. Else, I will show what I ended up with and mark solved myself :) And thank you, I learned other things from your example too :) – Thomas Apr 09 '21 at 08:03
  • @Thomas Great! Please go ahead and write an answer. I won't have time right now. :-) – Ted Lyngmo Apr 09 '21 at 08:16

1 Answers1

3

Thanks for all the comments. I ended up incorporating Ted Lyngmo's example which improved performance from 80ms to 7ms per frame using all my cores.

The task struct:

#ifdef __cpp_lib_hardware_interference_size
  using std::hardware_constructive_interference_size;
  using std::hardware_destructive_interference_size;
#else
  constexpr std::size_t hardware_constructive_interference_size = 2 * sizeof(std::max_align_t);
  constexpr std::size_t hardware_destructive_interference_size = 2 * sizeof(std::max_align_t);
#endif

struct alignas(hardware_destructive_interference_size) raytrace_task
{
    // ctor omitted

    void operator()()
    {
        // raytrace one screen-chunk here
    }
}

and the code triggering the raytracing each frame:

#include <execution>

// ...

void trace()
{
    const auto thread_count = std::thread::hardware_concurrency();

    // generate render-chunks into multiple raytrace_tasks:
    std::vector<raytrace_task> tasks;
    for (auto i = 0u; i < thread_count; i++)
    {
        tasks.push_back(raytrace_task(world_, i, thread_count, camera_, screen_));
    }

    // run the raytrace_tasks:
    std::for_each(std::execution::par, tasks.begin(), tasks.end(), [](auto& task) { task(); });
}

Note: I also had to set Visual Studio to compile in C++17 (project properties > C/C++ > Language)

Thomas
  • 432
  • 3
  • 13
  • 80ms to 7ms - nice! Does `std::execution::seq` work in `_DEBUG` mode? If so, you could have one `#ifdef` globally and do `auto& policy = std::execution::seq;` or `auto& policy = std::execution::par;` and then use `policy` in your `for_each` (and any other algorithms supporting it). – Ted Lyngmo Apr 09 '21 at 09:33
  • Another thing: You _may_ get even more out of it by not limiting the number of tasks to the number of hardware threads. It's a bit counter intuitive but I've noticed that splitting the full task up in even smaller chunks sometimes helps. I'm not sure how that works (if each individual task has 0 idle time) but it might be worth trying out the double amount of tasks, if nothing else just to rule it out. – Ted Lyngmo Apr 09 '21 at 09:46
  • 1
    @TedLyngmo oooh... i was wrong, i forgot to set Debug config to C++17 also, I'll change the code :D (and no, it doesn't improve when splitting up in even more tasks than HW threads, in my case) – Thomas Apr 09 '21 at 10:19