Monte Carlo Tree Search Tic-Tac-Toe -- Poor Agent

Question

I'm trying to implement Monte Carlo tree search to play tic-tac-toe in Python. My current implementation is as follows:

I have a Board class that handles alterations to the tic-tac-toe board. The state of the board is represented by a 2x3x3 numpy array, where each of the 2 3x3 matrices are binary matrices individually representing the presence of X's and the presence of O's.

class Board:
  '''
  class handling state of the board
  '''

  def __init__(self):
    self.state = np.zeros([2,3,3])
    self.player = 0 # current player's turn

  def copy(self):
    '''
    make copy of the board
    '''
    copy = Board()
    copy.player = self.player
    copy.state = np.copy(self.state)
    return copy

  def move(self, move):
    '''
    take move of form [x,y] and play
    the move for the current player
    '''
    if np.any(self.state[:,move[0],move[1]]): return
    self.state[self.player][move[0],move[1]] = 1
    self.player ^= 1

  def get_moves(self):
    '''
    return remaining possible board moves
    (ie where there are no O's or X's)
    '''
    return np.argwhere(self.state[0]+self.state[1]==0).tolist()

  def result(self):
    '''
    check rows, columns, and diagonals
    for sequence of 3 X's or 3 O's
    '''
    board = self.state[self.player^1]
    col_sum = np.any(np.sum(board,axis=0)==3)
    row_sum = np.any(np.sum(board,axis=1)==3)
    d1_sum  = np.any(np.trace(board)==3)
    d2_sum  = np.any(np.trace(np.flip(board,1))==3)
    return col_sum or row_sum or d1_sum or d2_sum

I then have a Node class, which handles properties of nodes as the search tree is being built:

class Node:
  '''
  maintains state of nodes in
  the monte carlo search tree
  '''
  def __init__(self, parent=None, action=None, board=None):
    self.parent = parent
    self.board = board
    self.children = []
    self.wins = 0
    self.visits = 0
    self.untried_actions = board.get_moves()
    self.action = action

  def select(self):
    '''
    select child of node with 
    highest UCB1 value
    '''
    s = sorted(self.children, key=lambda c:c.wins/c.visits+0.2*sqrt(2*log(self.visits)/c.visits))
    return s[-1]

  def expand(self, action, board):
    '''
    expand parent node (self) by adding child
    node with given action and state
    '''
    child = Node(parent=self, action=action, board=board)
    self.untried_actions.remove(action)
    self.children.append(child)
    return child

  def update(self, result):
    self.visits += 1
    self.wins += result

Finally, I have UCT function which pulls everything together. This function takes a Board object and builds the Monte Carlo search tree to determine the next best possible move from the given board state:

def UCT(rootstate, maxiters):

  root = Node(board=rootstate)

  for i in range(maxiters):
    node = root
    board = rootstate.copy()

    # selection - select best child if parent fully expanded and not terminal
    while node.untried_actions == [] and node.children != []:
      node = node.select()
      board.move(node.action)

    # expansion - expand parent to a random untried action
    if node.untried_actions != []:
      a = random.choice(node.untried_actions)
      board.move(a)
      node = node.expand(a, board.copy())

    # simulation - rollout to terminal state from current 
    # state using random actions
    while board.get_moves() != [] and not board.result():
      board.move(random.choice(board.get_moves()))

    # backpropagation - propagate result of rollout game up the tree
    # reverse the result if player at the node lost the rollout game
    while node != None:
      result = board.result()
      if result:
        if node.board.player==board.player:
          result = 1
        else: result = -1
      else: result = 0
      node.update(result)
      node = node.parent

  s = sorted(root.children, key=lambda c:c.wins/c.visits)
  return s[-1].action

I've scoured this code for hours and simply can't find the error in my implementation. I've tested numerous board states and pitted two agents against each other, but the function returns poor actions for even the most simple of board states. What am I missing and/or what is wrong with my implementation?

edit: here is an example of how two agents might be implemented to play:

b = Board() # instantiate board
# while there are moves left to play and neither player has won
while b.get_moves() != [] and not b.result():
    a = UCT(b,1000)  # get next best move
    b.move(a)        # make move
    print(b.state)   # show state

Please add all the relevant code so that we can run this ourselves. If UCT is the entry point then what arguments should it be given? — Alex Hall, Mar 23 '18 at 19:30
@AlexHall I've added a simple implementation of dueling agents starting with an empty board — sam, Mar 23 '18 at 19:42
OK, I don't know this algorithm and I'm not willing to tackle this question, but for anyone else who tries here's the full code to get you started quickly: https://pastebin.com/RB9K7cDD I've also added a method to display the board nicely. — Alex Hall, Mar 23 '18 at 20:23

score 3 · Accepted Answer · answered Mar 24 '18 at 10:28

The problem appears to be as follows:

Your get_moves() function does not check if the game is already over. It can generate a non-empty list of moves for states where someone has already won.
When creating a new Node, you also don't check if the game state is already over, so a non-empty list of untried_actions is created.
In the Selection and Expansion phases of the algorithm, you also don't check for terminal game states. Once the Expansion phase hits a node that contains a game state where someone already won, it will happily apply an extra action and create a new node for the tree again, which subsequent Selection phases will also happily keep going through.
For these nodes where the game continues being played after someone already won, result() can return an incorrect winner. It simply checks if the most recent player to make a move won, which is correct if you stop playing as soon as someone wins, but can be incorrect if you keep playing after someone already won. So, you propagate all kinds of incorrect results through the tree.

The simplest way to fix this will be to modify get_moves() such that it returns an empty list when the game is already over. Then, these nodes will always fail the if node.untried_actions != [] check, which means the expansion phase gets skipped altogether, and you move straight on to the Play-out phase where there is a proper check for terminal game states. This can be done as follows:

def get_moves(self):
    """
    return remaining possible board moves
    (ie where there are no O's or X's)
    """
    if self.result():
        return []

    return np.argwhere(self.state[0] + self.state[1] == 0).tolist()

Thanks! That was exactly the problem. Shortly after posting I noticed certain terminal states were slipping through due to the way I have my result() function implemented. I simply added a test to see if the state was terminal before performing the expansion step, which seemed to fix things. Thanks for the detailed response, cheers — sam, Mar 24 '18 at 18:11

Monte Carlo Tree Search Tic-Tac-Toe -- Poor Agent

1 Answers1