Is it a bad idea to create a large amount of tasks for a few available cores only?

Question

I'm new to parallel programming with TPL and just finished listening a TPL course. I have just written a small piece of demo software to test my understanding of it.

Let me explain the context of my prallelised code before asking my general questions: (if bored, skip directly to the questions ;-)

Context:

I first wrote a sequential recursive tree traversing recursive code (for any two-players state-known game) and tried then to parrallelise it, exposing as much parallelism as possible, by creating concurrent subtasks on each traversed (and evaluated) node. (One subtask is launched for each child node). The subtrees issuing from each child-node of the currently evaluated head node beeing themselves evaluated by the same recursive method called by one of the launched concurrent subtasks.

In the game-tree, the nodes represent game positions. The root node beeing the current game position and its n child nodes (on 2nd tree level) the game positions that can be reached by applying the n legal moves for the player A in the root position. The 3rd level nodes represent the reachable positions by applying the legal moves played by player B on each 2nd level positions/nodes, and so on until a given max search depth is reached..

The tree is traversed in a depth-first traversal order until a given max search depth where the "leaf" positions will be evaluated by an heuristic evaluation function that returns an integer value representing the advantage of player A vs player B on this game position. The greater this value is, the better the position is supposed to be for the player A.

The best value of all child nodes is reported to their common parent node, (Of course if the parent node represents a position played by player A, the best child node is the child-node with the highest value, and if it is played by player B, the best child node is the one with the lowest value).

The best child values are finally reported to the root node with the index of the 2nd level child node that reported its best value to the root node. Of course, this index represents the "best" move for player A. That means the move that guarantee to the player A that the reported value (or a better one) can be reached at the explored depth ("leaf" position) whatever the moves played by the player B.

Ok, this is the well known minimax algorithm for two-players state-games.

You maybe also know the classical optimisation of this sequential algorithm, called alpha-beta, allowing to prune some branches/sub-branches of the game tree when we know they are not leading to interesting leaf positions.

So I tried to parallelise this recursive tree traversal reporting the leaf values to the root node and allowing complete full tree traversal cancellation (if the user cancels the search) but also allowing partial sub-tree tasks cancellation to implement the alpha-beta pruning that has to abort the tasks performing the evaluation of some child-node sub-trees without interruping the tasks working on other subtrees.

The work is done succesfully, the parallelised search is faster than the sequential one and dispatches the work on all the available cores running almost always à 100% (according to the task manager). I don't want to bore you with my source code. I just want to precise that I also applied the .Net 4.5 async await pattern to my recursive parallelised method, and implemented the "wait-all one by one" pattern (actually await Task.WhenAll()) to create the sub-tasks exploring the sub-trees issued from each child node. (Note: To allow subtree traversal cancelling, I had to pass a list of cancellation tokens to the recursive async subtree exploration method, each recursing exploration level adding a new CancellationToken to the list).

Now that you understand the context, here are my questions:

Launching a search will rapidly create millions of tasks evaluating millions of subtrees issued from each tree node, but only a few processor cores are actually available to execute these tasks with a few threads (one per core ?). So is that a good idea not to take this into account and to create tasks for each child node/subtree having to be evaluated/traversed ? Is there no overhead induced by creating more tasks than necessary to expose the maximum of parallelism ? Wouldn't it be better to limit the number of launched tasks (for exemple by lauching tasks for the highest level nodes only and using a purely sequential traversal method for the lowest level ones ?) Or can we not care about this by creating million of tasks, even if we know that some of them (these launched on the deepest tree nodes) perform a very small work ?
I didn't prevent the "passing loop index variable to the launched task by closure" issue using a lambda expression argument (like proposed in the course) but by copying the loop index variable in a local copy variable before launching the task and referring this copy by closure, instead of the index loop variable from the lambda expression. What do you think of this workaround ? Is that as safe as the lambda expression argument passing ?
The alpha-beta pruning optimisation of the minimax algorithm is based on the sequential traversal of each subtree. Parallel traversing makes the alpha-beta pruning less efficient since if all the parallel subtree traversals of the child nodes take aproximately the same time to return, none subtree traversal task will be aborted, and if they will, they will be aborted as they all were already almost completed... This will highly limit the gain given by the alpha-beta optimisation. Do you know smart strategies to adress this issue when parallelising the game-tree traversal with alpha-beta optimisation ?

No, tasks aren't threads. On the other hand, if you want to process a lot of data you should use *Data* (PLINQ), not *Task* parallelism to partition the data and avoid the overhead of scheduling tasks on the ThreadPool. You should also check the SIMD support in .NET 4.6 (the Vector classes) because it offers at least 4x performance increase without overhead from a single core. Instead of working eg on a single float at a time, a single CPU operation works on 4 — Panagiotis Kanavos, Nov 10 '15 at 10:43
Yes tasks aren't threads. But of course, tasks are executed on threads, and creating a large amount of tasks, even for very small batches of work (the low level nodes), implies a lot of context changes that might induce overhead. That's why I was wondering if the deepest nodes/subtrees should be traversed by a sequential method. Thanks for the alternate suggestions (PLINQ and SIMD), not sure it is easy to implement in my context but Il think about it. — Bug Raptor, Nov 10 '15 at 12:57
That's why in data parallelism you partition the data then you use a small set of tasks to process it. Task parallelism is used when you have many separate tasks that should run concurrently/asynchronously. In your case though, you have a lot of nodes to process, which makes Data parallelism appropriate. Data parallelism algorithms though have a different structure than sequential ones (except eg in simple loops). It's like changing a FOR loop to a SQL statement, or a MapReduce job. — Panagiotis Kanavos, Nov 10 '15 at 13:14
So how many tasks are actually "live" (existing) at any one moment? (You can track this with an atomically incremented/decremented counter and a high water mark). — Ira Baxter, Nov 11 '15 at 09:56
It depends on what you mean by "live" tasks, if I increment the counter just before creating a task and decrement it once this task is completed (after the await statement) (and when catching the OperationCanceledException raised by canceled tasks, decrementing the counter once for each still remaining awaited task) I get millions of simultaneously existing tasks. (around the total nodes of the traversed tree). — Bug Raptor, Nov 11 '15 at 12:28
If you mean "live = executing its execution code" (so if I increment the counter at the beginning of the task lambda expression code and decrement it at the end - using a try finally statement) I notice the max number of simultaneously executing task is 8, which logically is the number of available cores on my machine...) — Bug Raptor, Nov 11 '15 at 12:28
@Ira Baxter. Sorry I was wrong in my previous comments. I just fixed my simultaneously existing tasks counting code which was wrong... The result surprised me. Actually only a maximum of about 1000 tasks are simultaneously "active" when I traverse a tree with 8 level and a branching factor of 10. This is far less than what I expected... I do not really understand why... — Bug Raptor, Nov 11 '15 at 13:20
@BugRaptor: If you have done A-B well, the actual effective branching factor is usually sqrt(B) where B is the branching factor in the space. So, sqrt(10) is 3, raised to the 8 ~~ 1000. Parallelism may make this worse; you might speculate down a branch that serial A-B would not have tried. Parallelism may make this better; searching several branches at once may produce a tighter bound in that part of the space, producing an earlier cutoff. To avoid the speculation effect, I'd implement iterative deepening (https://en.wikipedia.org/wiki/Iterative_deepening_depth-first_search). — Ira Baxter, Nov 11 '15 at 13:27
In fact, if you are going to do search, reading *anything* and everything written by Richard Korf. If you are going to parallel A-B search, read anything and everything by Tony Marsland: http://dblp.uni-trier.de/pers/hd/m/Marsland:T=_Anthony. You should find this interesting: https://chessprogramming.wikispaces.com/Parallel+Search — Ira Baxter, Nov 11 '15 at 13:30
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/94810/discussion-between-bug-raptor-and-ira-baxter). — Bug Raptor, Nov 11 '15 at 13:43
I've had nothing but bad experiences with chat (cutting me off at some artificial threashold). Let us continue it here; it doesn't bother me that SO whines about it. — Ira Baxter, Nov 11 '15 at 13:59
Another good book for learning parallel programming "[Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4](https://www.microsoft.com/en-us/download/details.aspx?id=19222)". It's a free E-Book from Microsoft, I learned quite a lot from it when I first read it. Also your previous comment *"Yes tasks aren't threads. But of course, tasks are executed on threads"*, Tasks are a ***promise to complete some work***, that promise could be filled by a background thread or it could be filled by a event or IO process. — Scott Chamberlain, Nov 11 '15 at 14:21
@Ira Baxter: Thank you for your commentaries. Actually I'm just experimenting parallel programming in c#/.Net with the TPL and the async/await pattern. I was programming sequential A-B search a long years ago when I was a young student (1984 !) and just tried to practice parallel programming with TPL on the subject of Game-Decision-Tree traversal... I just discoverd some interesting papers concerning the parallel A-B implementation, including the Marsland's one.... I'll read more about this interesting subject. — Bug Raptor, Nov 11 '15 at 15:35
@Ira Baxter: As a matter of (provisory) conclusion, I understand that, using the TPL, I don't really have to care about of the number of Tasks I create. dispatching the threads of the thread pool on these tasks and on the available cores is the TPL stuff. I have no real control about it and can trust it will perform this dispatching not so bad... Am I right ? Allthough I agree it might be a good thing to decompose the full tree traversal in not so small bunks of work like are the deepest nodes, I'll try not to create subTasks at the deepest levels to see if the performance becomes better... — Bug Raptor, Nov 11 '15 at 15:35

score 3 · Answer 1 · edited May 23 '17 at 11:59

The question about "how many tasks" is really driven by two things:

conceptual program organization
task overhead

Assigning one task per node in the tree is conceptually clean and easy. That's the upside.

The downside is that the reasoning for using tasks is utilization of parallelism effectively, not program organization. You end up with the best parallelization when you have lots of work for your processors to do, and the overhead of managing that work is small compared to the work.

The reason you chose tasks at the node level is that is it easy to see there are (arguably) lots of nodes. [In your comment, you remarked that there were fewer units of work than you expected for a deep search tree!]. The other reason is that you don't know exactly how much work a node takes, so having lots of tasks around means a processor can take one, work on it till finished, and then go get another. If they all have some average time, the processors will execute and find other work, without wasting a lot of time waiting for other work to appear.

The key problem is overhead. To "have" a task, there's a bunch of overhead: something must create it, fill it with work, put it into a work queue where some other processor can get it; that other processor must pull it out of the queue, pick up its content, [not overhead: execute it ], and finally un-create it. Each of these bits of overhead takes some computer instructions, some memory accesses (if you are lucky they are cache hits) and usually some expensive synchronization between processors (you can't have all the CPUs putting new tasks in the available work queue at the same moment).

I don't know what the work overhead is for C# TPL. You can probably measure it by writing simple loop that simulates a task body and measure that; and then measure how long that loop takes when dumped into a task. I do know that you want to minimize the overhead (I designed a parallel programming language PARLANSE around that idea) and maximize the work you put into a task.

In your case, I'd be worried that work in a node ("generate next move and trivially evaluate it") wasn't really very much, and that you would actually lose performance by going parallel. Have you measured how fast you same program runs if you take out the parallelism? (PARLANSE is fast enough so that one can code a parallel 8 puzzle solver, yet have the generate-move-evaluate still be considerably more work than the overhead).

The standard trick when you don't have enough work in task to overwhelm overhead, is to put more work into the task.

In your case, you might decide to let a task represent a 2 ply or 3 ply search; that should create plenty of work per task. And for an 8 ply game search, you'll still have plenty of generated tasks to keep the processors busy. I show a similar trick (coarse-grain-threshold) to make parallel Fibonacci work, which is about as awful a work/overhead ratio scheme that one can invent.

It is well known in the chess world that a sophisticated board evaluation acts as if one had done a 2 or 3 ply search at that board position. So your other choice is implement an expensive board evaluation routine; that will make the overhead small in the leaves.

Alpha-beta may be saving you ("only 1000 tasks") from running out of VM due to "big stacks" and many tasks. The good news about generating lots of work is you have lots to give the CPUs. The bad news is that they can take up a lot of space in the VM while they are sitting around not being executed; MS's big-stack model aggravates this. Even if it doesn't run you out of VM, it may cause your program to touch vast amounts of memory, thus negating the value of the processor's cache. I'd worry about this, but don't have good advice for C#; I built PARLANSE to avoid these problems; it automatically throttles task generation when it has "enough" work. MS has some smart cookies, though.

Some TPL-specific clarifications: a `Task` is not a thread, it doesn't have a stack (so it doesn't take much memory when not running). A `Task` is basically an abstraction over a thread pool work item. So it should be fairly cheap, but all the overheads you identified are still there. — svick, Nov 11 '15 at 17:12
What often makes sense when the overhead of `Task` is too large (relative to the amount of work per node) is a producer-consumer approach: have a small constant number of `Task`s (e.g. one per CPU) that process items from a thread-safe queue. For example, [`ActionBlock` from TPL Dataflow](https://msdn.microsoft.com/en-us/library/hh194684) can be easily used like this. — svick, Nov 11 '15 at 17:15
@svick: so once a task starts running, it runs completely to termination without stopping? if so, it doesn't need stack space. But if one task can get stuck waiting for another (this happens in PARLANSE at lot, climbing over irregular data structures), where does the suspended task store its local state? If lots of tasks can wait, you need lots of space to store that local state. MS Tasks don't have this problem? — Ira Baxter, Nov 11 '15 at 17:43
Yeah, I was assuming the `Task`s were independent. If they need to wait on each other, you can either do that synchronously (`task.Wait()` or `task.Result`), or asynchronously (`await task`). If you use the synchronous approach, you indeed run into the problem that the thread pool has to keep creating new threads, which means you can run out of memory. — svick, Nov 11 '15 at 17:53
So how does await-task avoid this? Apparantly one task can call many methods building up a deep return stack and then do await-task. How can it "await" without recording the call stack? — Ira Baxter, Nov 11 '15 at 23:24
`await` requires you to turn the method into and `async` `Task`-returning method, so the caller has to use `await` to call it in turn. In effect, the call stack is represented as continuations on the heap. (But I don't think further explanations belong here and there is lots of information about `async`-`await` out there.) — svick, Nov 11 '15 at 23:52

Is it a bad idea to create a large amount of tasks for a few available cores only?

1 Answers1