I saw other old posts spinning around this topic, but the general answers given were like "don't care, let the NN learn that, in that specific state, it cannot take some actions punishing it!". Well, I don't like it! For several reasons:
many publications speak about the action space not as A, but as A(s). So it is normal to consider the action space as a function of the current state s.
in reality if you have a wall on your left it is not just a matter of hurting yourself trying to pass it, you simply don't have this option. I cannot understand why my RL should still have a chance, even if not probable, to go left after the training
why should we accept extra learning effort to let the RL agent learn something that is already known?
just to mention some of the reasons. I saw that the Discrete object defined in the Gymnasium library has a masking array in order to declare which actions are available, but I can see it just in the random sampling function
def sample(self, mask: Optional[np.ndarray] = None) -> int:
"""Generates a single random sample from this space.
A sample will be chosen uniformly at random with the mask if provided
Rather than this, I think that implementing a "dynamic" actions space as a function of the current state should impact somehow during the training process on the agent.collect_policy. I am struggling on finding complete and working examples of how to implement such a simple capability. That is not so simple to me in the end and I would like to understand if there are not well-documented (as many other things regretfully) already developed and elegant solutions in the TF-Agents / TensorFlow context.