3

Simple question on hello world example of the MCTS for tic-tac-toe,

Let's assume we are given a board and we want to make an optimal decision. As I undestand the choice of consecutive nodes while simulation (until leaf is met) is determined by a exploration/exploitation trade-off function (as described on wikipedia). I really wonder what is the intuition behind first component (exploitation) of the function here, especially for games between two players with oppposite goals. Then the meaning of "the most promising" changes depending on who makes a move. Shouldn't this function change depeding on who makes the next move (especially its first component)?

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140

1 Answers1

3

Yes, that exploitation part of the equation should be implemented to take into account the evaluations from the perspective of the agent/player who gets to select an action in that node.

For single-agent settings, the implementation is straightforward; simply always maximize.

For zero-sum, turn-based, two-player settings, you'd want to alternate between maximizing or minimizing that exploitation part of the equation (note: always maximize the exploration term!). This can also be implemented by simply multiplying that term by -1 in nodes where the opponent gets to move.

Other settings are possible too, but require slightly more implementation effort (e.g. keeping different average scores for different players in settings which are not zero-sum or have more than two players)

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55