Using Task.Yield to overcome ThreadPool starvation while implementing producer/consumer pattern

Question

Answering the question: Task.Yield - real usages? I proposed to use Task.Yield allowing a pool thread to be reused by other tasks. In such pattern:

  CancellationTokenSource cts;
  void Start()
  {
        cts = new CancellationTokenSource();

        // run async operation
        var task = Task.Run(() => SomeWork(cts.Token), cts.Token);
        // wait for completion
        // after the completion handle the result/ cancellation/ errors
    }

    async Task<int> SomeWork(CancellationToken cancellationToken)
    {
        int result = 0;

        bool loopAgain = true;
        while (loopAgain)
        {
            // do something ... means a substantial work or a micro batch here - not processing a single byte

            loopAgain = /* check for loop end && */  cancellationToken.IsCancellationRequested;
            if (loopAgain) {
                // reschedule  the task to the threadpool and free this thread for other waiting tasks
                await Task.Yield();
            }
        }
        cancellationToken.ThrowIfCancellationRequested();
        return result;
    }

    void Cancel()
    {
        // request cancelation
        cts.Cancel();
    }

But one user wrote

I don't think using Task.Yield to overcome ThreadPool starvation while implementing producer/consumer pattern is a good idea. I suggest you ask a separate question if you want to go into details as to why.

Anybody knows, why is not a good idea?

I have no conclusive idea about the original commenters motivation, but you should try to avoid having a busy loop waiting for data to arrive, instead you should use a mechanism which allows you to trigger the processing. — Lasse V. Karlsen, Nov 12 '18 at 13:33
@LasseVågsætherKarlsen By the way, i already used this pattern while implementing workers coordinator for the message bus - https://github.com/BBGONE/REBUS-TaskCoordinator It works fine. But it uses pulling messages from the queue. Producer -Consumer pattern is a message pushing - as it is done in Async Producer/Consumer Queue using Dataflow: https://blog.stephencleary.com/2012/11/async-producerconsumer-queue-using.html — Maxim T, Nov 12 '18 at 13:39
I'd argue that hot loops are bad *with or without* adding async to the mix - I'd forgive it a lot more if it was `await Task.Delay(50)` or something, but: it would be even better to use an async activation rather than checking in this way; there is the new "channels" API, for example (https://www.nuget.org/packages/System.Threading.Channels/) - which is *designed* for async producer/consumer scenarios — Marc Gravell, Nov 12 '18 at 13:40
@MarcGravell - I think for very short CPU bound tasks it is ok, for long running CPU bound tasks it is better to use a custom TaskScheduler to run the tasks on dedicated threads. I already tested it here: https://github.com/BBGONE/TaskCoordinator/blob/master/TaskCoordinatorTest/TestLibrary/TestMessageDispatcher.cs — Maxim T, Nov 12 '18 at 13:44
@MarcGravell The channels look good to me. I will research more info about them. Looks like performance is exceptional. https://www.reddit.com/r/dotnet/comments/8b4jq3/dataflow_vs_channels_evolution_of_asyncfilewriter/ — Maxim T, Nov 12 '18 at 16:18
@MaximT indeed - it is what I'm using for ordered message queues in SE.Redis: https://github.com/StackExchange/StackExchange.Redis/blob/master/src/StackExchange.Redis/ChannelMessageQueue.cs#L57 — Marc Gravell, Nov 12 '18 at 16:20
It is the *exact* opposite of what the Threadpool manager tries to do. It makes an effort to limit the number of active tp threads to the ideal number in order to cut down on the context switching overhead. When you use Task.Yield then you add context switching overhead. If you have too many tp threads that don't execute code efficiently (blocking too much) then use SetMinThreads(). — Hans Passant, Nov 12 '18 at 16:22
@HansPassant the Worst performance degradation was when i increased min threads number using SetMinThreads. The CPU usage became 100% and performance dropped sharply. Thanks for the advice, anyway! — Maxim T, Nov 13 '18 at 13:17
@MarcGravell i added the answer for the question and included the tests in them. It seems, the Task.Yield performance influence is a bit exagerated. With only one Task.Yield the performance with UltraShort Task dropped 15%. The biggest drop was only when added two Task.Yields - the drop was about 90%. With bigger tasks (not so short) the performance drop is very negligible. Anyway with Task.Yield and 6 threads it processes 476 000 messages per second (without it 570 000). — Maxim T, Nov 13 '18 at 13:26
Some programmers think that 50% cpu usage is better. That's a very mystifying idea, they could have saved a lot of money on the machine they bought. Use a concurrency analyzer to find out what is *really* going on, VS has a [slick one available](https://marketplace.visualstudio.com/items?itemName=Diagnostics.ConcurrencyVisualizerforVisualStudio2015). — Hans Passant, Nov 13 '18 at 13:27
@HansPassant The problem starts when i set min threads above the number of processors in the system. They start using context switching on the OS level. The ThreadPool (by default) does not process a lot of tasks in parallel even if i start 1000 of them. But when i set min threads number above the number of processors in the OS, it starts to execute more tasks - it oversaturate the processors. — Maxim T, Nov 13 '18 at 13:37
Well, of course that's the way it must work. No amount of affordable money is going to buy you a machine with a thousand processor cores. You can't slam the threadpool with a that many jobs to do and expect instant magic. These are important details that belong in the question btw. — Hans Passant, Nov 13 '18 at 13:44
@MarcGravell I added a test of Threading.Channels vs BlockingCollection performance in producer - consumer pattern https://github.com/BBGONE/TestThreadAffinity The performance is almost the same. They were probably meant not for performance, but not to block theadpool threads - while waiting for messages. — Maxim T, Nov 14 '18 at 09:10
@MarcGravell At first I modeled the Channels like you did in the StackExchange example - unbounded. Looks like they perfom better if they are bounded and the producer pumps messages to the writer while the reader reads them. — Maxim T, Nov 14 '18 at 12:45
@MarcGravell I ported the test for the Threading.Channels to CoreFX (instead of full Net Framework) - it started to work 2,5 times faster. Now it is above 1 million messages per sec on my comp. I added this solution to the test. They are really good. — Maxim T, Nov 17 '18 at 18:17
As a side-note, the `cancellationToken.ThrowIfCancellationRequested();` at the end of the `SomeWork` method is against the recommended cancellation patterns. It is possible that the `cancellationToken` is canceled at a moment that your work is about to complete, and it actually completes successfully before you have a chance to observe the cancellation of the token before its completion. In this case you shouldn't propagate an `OperationCanceledException`, because the operation was not actually canceled. — Theodor Zoulias, Feb 16 '23 at 12:50

noseratio · Accepted Answer · 2018-11-13T02:03:34.737

5

There are some good points left in the comments to your question. Being the user you quoted, I'd just like to sum it up: use the right tool for the job.

Using ThreadPool doesn't feel like the right tool for executing multiple continuous CPU-bound tasks, even if you try to organize some cooperative execution by turning them into state machines which yield CPU time to each other with await Task.Yield(). Thread switching is rather expensive; by doing await Task.Yield() on a tight loop you add a significant overhead. Besides, you should never take over the whole ThreadPool, as the .NET framework (and the underlying OS process) may need it for other things. On a related note, TPL even has the TaskCreationOptions.LongRunning option that requests to not run the task on a ThreadPool thread (rather, it creates a normal thread with new Thread() behind the scene).

That said, using a custom TaskScheduler with limited parallelism on some dedicated, out-of-pool threads with thread affinity for individual long-running tasks might be a different thing. At least, await continuations would be posted on the same thread, which should help reducing the switching overhead. This reminds me of a different problem I was trying to solve a while ago with ThreadAffinityTaskScheduler.

Still, depending on a particular scenario, it's usually better to use an existing well-established and tested tool. To name a few: Parallel Class, TPL Dataflow, System.Threading.Channels, Reactive Extensions.

There is also a whole range of existing industrial-strength solutions to deal with Publish-Subscribe pattern (RabbitMQ, PubNub, Redis, Azure Service Bus, Firebase Cloud Messaging (FCM), Amazon Simple Queue Service (SQS) etc).

edited Nov 13 '18 at 02:03

answered Nov 12 '18 at 23:29

noseratio

59,932
34
208
486

1

i know about all the well established solutions - i used some of them myself. In terms of performance, Kafka is the best in this range, and NATS as well. But the performance gains are usually at the expense of the reliability. For reliable message processing it is needed to read them from a durable store and don't buffer them in memory. And the tasks are usually not that simple as to process a single byte, but usually take some milliseconds. I usually use WorkStealingTaskScheduler for long CPU bound tasks (it has a pool of custom threads). So it all depends on the context where it used. – Maxim T Nov 13 '18 at 03:49
The key in your answer - don't use Task.Yield in a tight loop. I agree 100%. But if the time taken to process each iteration exceeds additional ThreadPool.QueueUserWorkItem then the performance decrease is negligible, but increases responsiveness and tasks cooperation. By the way - it easy to test with a small custom setup. In a tight loop a decrease around 100%, but if some job (around several milliseconds) is done in each iteration - then the decrease is less than 10%. – Maxim T Nov 13 '18 at 03:53
@MaximT, in any case I wouldn't overload the default `ThreadPool` with a million of small computational tasks. But let's say you have a custom pool, e.g. you created one with `WorkStealingTaskScheduler`. It would be interesting to see the actual benchmarks. E.g., having 10000 tasks each calculating the first 10000 digits of Pi number. Then compare it to `ThreadPoolTaskScheduler` (make sure to fix the number of threads with `SetMinThreads/SetMaxThreads`). Then compare it to a task scheduler with actual thread affinity (AFAIR, `WorkStealingTaskScheduler` isn't affine for `await` continuations). – noseratio Nov 13 '18 at 04:20
I my experience the performance killer is not a context switching, but very often context switching. If the job is very short as calculating PI numbers, then batching 10000 iterations into one ThreadPool.QueueUserWorkItem will solve the problem. the WorkStealingTaskScheduler is used for longer CPU bound synchronous tasks without async await in them. Look at https://github.com/BBGONE/REBUS-TaskCoordinator/blob/master/Showdown/Rebus.Transports.Showdown.Core/MessageReceiver.cs how the mixed job is split into subtasks (HandleLongRunMessage method). – Maxim T Nov 13 '18 at 05:04
[The price of context switch] = [context switch duration] / ([job duration]+[context switch duration]). The shorter the job, the bigger the price. For too long jobs - the thread pool is not a solution anyway. The drawback of the ThreadAffinityTaskScheduler is that it is not portable to the NET.Core - it is platform dependent. And this problem could be solved with microbatching - the batch will be processed on a single thread. – Maxim T Nov 13 '18 at 07:10
@MaximT i'm not suggesting to use my version of ThreadAffinityTaskScheduler, you can easily create your own. Also we are talking here using Task.Yield on every iteration of the inner loop (don't we?), so each scheduled task is as short as that iteration is. – noseratio Nov 13 '18 at 07:29
Yes, we are talking about a single inner loop - the question is the size of the job inside that loop - it could be processing a single byte, or a batch. The solution is not to use very small job payloads. // do something ... inside the loop means a substantial task, not to calculate a single PI number. I meant it was there a substantial work there, and you meant it was a tight loop. We misunderstood each other. – Maxim T Nov 13 '18 at 07:37
By the way - i tried to use the ThreadAffinityTaskScheduler in my testing lab - https://github.com/BBGONE/TaskCoordinator With UltraShortCpuTask the time taken to execute a 500 000 batch is exactly the same. It is very short CPU bound task. The problem is to test a task with continuations - because it is needed to find a very short truly async task (and Task.Delay is too course for the test). So i can not compare in that case. – Maxim T Nov 13 '18 at 08:54
I posted the tests to the GitHub https://github.com/BBGONE/TestThreadAffinity instead of a very short async task - i used Task.Yield for this. The code with threadaffinity scheduler performed a bit better 4437 ms vs 4741 ms. It is a 7% improvement. – Maxim T Nov 13 '18 at 09:30
I added better tests - removed serialization overhead. The performance with ultrashort task is around 460 000 messages per second. Without Task.Yield it is around 546 000 per sec. The ThreadAffinityTaskScheduler processes only 370 000 messages per sec (without Task.Yield). So as i said it is better to test than to speculate on theories. https://github.com/BBGONE/TestThreadAffinity – Maxim T Nov 13 '18 at 12:30
Oops, ThreadAffinityTaskScheduler processes 576 000 messages per sec (without Task.Yield). But with async part performance drops to 276 000 messages per seconds. The same as with the default Task Scheduler. With larger tasks the performance gain will be negligible – Maxim T Nov 13 '18 at 12:41
I ported the test for the Threading.Channels to CoreFX (instead of full Net Framework) - it started to work 2,5 times faster. Now it is above 1 million messages per sec on my comp. I added this solution to the test. They are really good. – Maxim T Nov 17 '18 at 18:18
@MaximT, that's good to know. `Threading.Channels` seems to the right tool for the job. – noseratio Nov 17 '18 at 21:14
@TheodorZoulias it was a good point anyway :) Where did you move it to, if I may ask? – noseratio Feb 16 '23 at 19:37

Maxim T · Answer 2 · 2018-11-14T16:14:30.607

After a bit of debating on the issue with other users - who are worried about the context switching and its influence on the performance. I see what they are worried about.

But I meant: do something ... inside the loop to be a substantial task - usually in the form of a message handler which reads a message from the queue and processes it. The message handlers are usually user defined and the message bus executes them using some sort of dispatcher. The user can implement a handler which executes synchronously (nobody knows what the user will do), and without Task.Yield that will block the thread to process those synchronous tasks in a loop.

Not to be empty worded i added tests to github: https://github.com/BBGONE/TestThreadAffinity They compare the ThreadAffinityTaskScheduler, .NET ThreadScheduler with BlockingCollection and .NET ThreadScheduler with Threading.Channels.

The tests show that for Ultra Short jobs the performance degradation is around 15%. To use the Task.Yield without the performance degradation (even small) - it is not to use extremely short tasks and if the task is too short then combine shorter tasks into a bigger batch.

[The price of context switch] = [context switch duration] / ([job duration]+[context switch duration]).

In that case the influence of the switching the tasks is negligible on the performance. But it adds a better task cooperation and responsiveness of the system.

For long running tasks it is better to use a custom Scheduler which executes tasks on its own dedicated thread pool - (like the WorkStealingTaskScheduler).

For the mixed jobs - which can contain different parts - short running CPU bound, asynchronous and long running code parts. It is better to split the task into subtasks.

private async Task HandleLongRunMessage(TestMessage message, CancellationToken token = default(CancellationToken))
{ 
            // SHORT SYNCHRONOUS TASK - execute as is on the default thread (from thread pool)
            CPU_TASK(message, 50);
            // IO BOUND ASYNCH TASK - used as is
            await Task.Delay(50);
            // BUT WRAP the LONG SYNCHRONOUS TASK inside the Task 
            // which is scheduled on the custom thread pool 
            // (to save threadpool threads)
            await Task.Factory.StartNew(() => {
                CPU_TASK(message, 100000);
            }, token, TaskCreationOptions.DenyChildAttach, _workStealingTaskScheduler);
}

The info on the channels: https://github.com/stephentoub/corefxlab/blob/master/src/System.Threading.Tasks.Channels/README.md — Maxim T, Nov 14 '18 at 12:48
It seems i figured it out. Although, the performance is the same in all usages, the channels have the benefit of nonblocking wait for when the channel can be written. This happens in the bounded scenarios. The BlockingCollection blocks the thread - not allowing to write, the Channels leave the thread to be used by others. — Maxim T, Nov 17 '18 at 12:54
I ported the test for the Threading.Channels to CoreFX (instead of full Net Framework) - it started to work 2,5 times faster. Now it is above 1 million messages per sec on my comp. I added this solution to the test. They are really good. — Maxim T, Nov 17 '18 at 18:15

Using Task.Yield to overcome ThreadPool starvation while implementing producer/consumer pattern

2 Answers2

Linked