I'm new to parallel programming with TPL and just finished listening a TPL course. I have just written a small piece of demo software to test my understanding of it.
Let me explain the context of my prallelised code before asking my general questions: (if bored, skip directly to the questions ;-)
Context:
I first wrote a sequential recursive tree traversing recursive code (for any two-players state-known game) and tried then to parrallelise it, exposing as much parallelism as possible, by creating concurrent subtasks on each traversed (and evaluated) node. (One subtask is launched for each child node). The subtrees issuing from each child-node of the currently evaluated head node beeing themselves evaluated by the same recursive method called by one of the launched concurrent subtasks.
In the game-tree, the nodes represent game positions. The root node beeing the current game position and its n child nodes (on 2nd tree level) the game positions that can be reached by applying the n legal moves for the player A in the root position. The 3rd level nodes represent the reachable positions by applying the legal moves played by player B on each 2nd level positions/nodes, and so on until a given max search depth is reached..
The tree is traversed in a depth-first traversal order until a given max search depth where the "leaf" positions will be evaluated by an heuristic evaluation function that returns an integer value representing the advantage of player A vs player B on this game position. The greater this value is, the better the position is supposed to be for the player A.
The best value of all child nodes is reported to their common parent node, (Of course if the parent node represents a position played by player A, the best child node is the child-node with the highest value, and if it is played by player B, the best child node is the one with the lowest value).
The best child values are finally reported to the root node with the index of the 2nd level child node that reported its best value to the root node. Of course, this index represents the "best" move for player A. That means the move that guarantee to the player A that the reported value (or a better one) can be reached at the explored depth ("leaf" position) whatever the moves played by the player B.
Ok, this is the well known minimax algorithm for two-players state-games.
You maybe also know the classical optimisation of this sequential algorithm, called alpha-beta, allowing to prune some branches/sub-branches of the game tree when we know they are not leading to interesting leaf positions.
So I tried to parallelise this recursive tree traversal reporting the leaf values to the root node and allowing complete full tree traversal cancellation (if the user cancels the search) but also allowing partial sub-tree tasks cancellation to implement the alpha-beta pruning that has to abort the tasks performing the evaluation of some child-node sub-trees without interruping the tasks working on other subtrees.
The work is done succesfully, the parallelised search is faster than the sequential one and dispatches the work on all the available cores running almost always à 100% (according to the task manager). I don't want to bore you with my source code. I just want to precise that I also applied the .Net 4.5 async await pattern to my recursive parallelised method, and implemented the "wait-all one by one" pattern (actually await Task.WhenAll()) to create the sub-tasks exploring the sub-trees issued from each child node. (Note: To allow subtree traversal cancelling, I had to pass a list of cancellation tokens to the recursive async subtree exploration method, each recursing exploration level adding a new CancellationToken to the list).
Now that you understand the context, here are my questions:
Launching a search will rapidly create millions of tasks evaluating millions of subtrees issued from each tree node, but only a few processor cores are actually available to execute these tasks with a few threads (one per core ?). So is that a good idea not to take this into account and to create tasks for each child node/subtree having to be evaluated/traversed ? Is there no overhead induced by creating more tasks than necessary to expose the maximum of parallelism ? Wouldn't it be better to limit the number of launched tasks (for exemple by lauching tasks for the highest level nodes only and using a purely sequential traversal method for the lowest level ones ?) Or can we not care about this by creating million of tasks, even if we know that some of them (these launched on the deepest tree nodes) perform a very small work ?
I didn't prevent the "passing loop index variable to the launched task by closure" issue using a lambda expression argument (like proposed in the course) but by copying the loop index variable in a local copy variable before launching the task and referring this copy by closure, instead of the index loop variable from the lambda expression. What do you think of this workaround ? Is that as safe as the lambda expression argument passing ?
The alpha-beta pruning optimisation of the minimax algorithm is based on the sequential traversal of each subtree. Parallel traversing makes the alpha-beta pruning less efficient since if all the parallel subtree traversals of the child nodes take aproximately the same time to return, none subtree traversal task will be aborted, and if they will, they will be aborted as they all were already almost completed... This will highly limit the gain given by the alpha-beta optimisation. Do you know smart strategies to adress this issue when parallelising the game-tree traversal with alpha-beta optimisation ?