2

I am training a PPO agent in a custom environment using the Ray RLLib library. As my action space has a lot of illegal actions, I have defined a custom model as suggested by the Ray documentation, to mask these actions out. The model is defined as follows:

class ActionMaskModel(TFModelV2):
    """Model that handles simple discrete action masking.
    This assumes the outputs are logits for a single Categorical action dist.
    Getting this to work with a more complex output (e.g., if the action space
    is a tuple of several distributions) is also possible but left as an
    exercise to the reader.
    """

    def __init__(
        self, obs_space, action_space, num_outputs, model_config, name, **kwargs
    ):

        orig_space = getattr(obs_space, "original_space", obs_space)
        assert (
            isinstance(orig_space, Dict)
            and "action_mask" in orig_space.spaces
            and "observations" in orig_space.spaces
        )

        super().__init__(obs_space, action_space, num_outputs, model_config, name)

        self.internal_model = FullyConnectedNetwork(
        # self.internal_model = ComplexInputNetwork(
            orig_space["observations"],
            action_space,
            num_outputs,
            model_config,
            name + "_internal",
        )

        # disable action masking --> will likely lead to invalid actions
        self.no_masking = model_config["custom_model_config"].get("no_masking", False)

    def forward(self, input_dict, state, seq_lens):
        # Extract the available actions tensor from the observation.
        action_mask = input_dict["obs"]["action_mask"]

        # Compute the unmasked logits.
        logits, _ = self.internal_model({"obs": input_dict["obs"]["observations"]})

        # If action masking is disabled, directly return unmasked logits
        if self.no_masking:
            return logits, state

        # Convert action_mask into a [0.0 || -inf]-type mask.
        inf_mask = tf.maximum(tf.math.log(action_mask), tf.float32.min)
        masked_logits = logits + inf_mask

        # Return masked logits.
        return masked_logits, state

    def value_function(self):
        return self.internal_model.value_function()

This model fulfills its main purpose, as it successfully ignores invalid actions. However, when training, I get the following warning:

KL divergence is non-finite, this will likely destabilize your model and the training process. Action(s) in a specific state have near-zero probability. This can happen naturally in deterministic environments where the optimal policy has zero mass for a specific action. To fix this issue, consider setting the coefficient for the KL loss term to zero or increasing policy entropy.

I have tried the proposed fixes from the warning message, but with no luck. After reading further into the issue I think I might have found the cause of the problem, but without knowing how to implement the solution.

A paper by Huang et al. (2020) (https://arxiv.org/pdf/2006.14171.pdf) investigates the effects of action-masking. They state that, when action masking, if the action itself is sampled according to the action-masked probability from the policy gradient, but the policy gradient then is updated from the non-action-masked probabilities, no illegal actions will be chosen, but the KL divergence explodes, thus worsening the training process. This sounds exactly like my problem!

How do I ensure that the policy gradient is updated according to the action-masked probabilities?

  • Intuitively, I would go about it as I did with a previous project: on a gridworld where some cells were invalid to enter, if the agent tried to enter it, it received a very negative reward and remained in the same cell. This indirectly learns to not take that action a in state s, though it might be that you have to adjust some attributes in the environment if the world itself should also not update, if the case. I just wanted to throw that out here, do what you like with it ;) – Lexpj Jun 28 '23 at 14:14
  • Thank you for your response! I did attempt this approach, however the model had a difficult time converging towards an optimum. Furthermore, as the paper by Huang et al. states, when using this approach it can be difficult to tune the negative reward associated with an illegal action. My current implementation of action masking does provide the best results of any optimization framework I've attempted so far - I would just like to improve the implementation to see if I can get even better results :)) – Jakob Sejten Jun 29 '23 at 12:09

0 Answers0