0

I am currently trying to implement my own version of a Connect Four Environment based on the version available on the PettingZoo Library github (https://github.com/Farama-Foundation/PettingZoo/blob/master/pettingzoo/classic/connect_four/connect_four.py).

From their documentation, in the page of the classic environments (https://pettingzoo.farama.org/environments/classic/) there is written the following thing:

" Most [classic] environments only give rewards at the end of the games once an agent wins or losses, with a reward of 1 for winning and -1 for losing. "

It is not clear to me how to model the learning for non-terminating states, if the reward signal (on which I guess the whole learning of the agents is based) occurs only for terminating states.

I thought to modify the setup by allowing the environment to emit rewards at each turn, something like:

  • +1 for each (non-terminating) step of the game

  • +100 for a winning state

  • 0 for a draw

  • -100 for illegal moves (and quitting the current game/episode) However, this setup would require very high exploratory rates for a $\epsilon$-greedy agent, given my current setup. This is because, for each state that has just been observed, the agent takes a random move and, if the state is not terminating, it will assign a state-action value of 1 to the just taken action, and zero for all the others. Otherwise, the agent will always pick the already taken action with very high probability, thus not allowing actual learning...

I am not so syure on how to solve this problem, as allowing for very high exploratory rates doesnt seem to me to be a good choice... My code is available on https://github.com/FMGS666/RLProject

Probably i should use the same setup as theirs in the github repo, but i didnt really quite understand how to do it for the aforementioned problem.

Probably im missing something important, but thank you very much for the help anyway!

  • The only practical way to do that is to assign a "score" for every position. I don't know how you would judge that one position is better than another position in Connect Four, but if you can't do that, then you cannot do machine learning. It's impossible. All positions are the same. – Tim Roberts Apr 11 '23 at 06:56
  • 1
    Shouldn't the agent learn the "score" for each position during training? What would be the point of its learning if I already assign a score to each position? When the agent has "no experience", all the positions are the same as you said, but this should not be the case after training it... – Lorenzo CONSOLI Apr 11 '23 at 07:05
  • 1
    I mean, the expected sarsa algorithm should learn such mapping from (state, action) tuple to value, but the problem is the reward signal during the game (on which, to my understanding, the whole learning of the agent is based) – Lorenzo CONSOLI Apr 11 '23 at 07:07
  • Right. If you can't assign a "score", like chess, then the only feedback is knowing whether a position leads to a win or a loss. I think there are too many paths on Connect Four to track that. – Tim Roberts Apr 11 '23 at 17:29
  • Lorenzo, you are on the right tract as the idea of the algo is to learn about its environment and what makes a good and bad move as there are sequence restrictions as well. I don't feel comfortable with saying yes/no on the reward schema, but seems reasonable, I just don't know if the system is unstable. Another approach is to use 1, 0, -1 for Win, Draw, Lose for term. Non terminating moves don't score, but illegal moves would terminate, you might be able to prevent this by making sure the chip always drops to the end. Then the question is do they win/lose – mazecreator Apr 13 '23 at 00:15
  • Thanks for your reply! Yeah it seems that I managed to fix the problem or illegal actions by rewarding a penalty for them Now the agent seems to be learning somehow. For the last question, tbh I still need to figure out how to measure the quality of an agent. Since I’m training it against itself, I don’t think winning/losing makes much sense. I though to measure how well it plays by counting how many moves it takes to beat a random agent Still though this is somehow made up Comulating rewards may not make much sense as idk if it’s a good measure for how well it plays… – Lorenzo CONSOLI Apr 14 '23 at 01:31

0 Answers0