0

Using Stable Baselines 3:

Given that deterministic=True always returns the action with the highest probability, what does that mean for environments where the action space is "box", "multi-binary" or "multi-discrete" where the agent is supposed to select multiple actions at the same time? How does deterministic=True work in these environments / does it work at all in the way it is supposed to?

The question is partly based on this question about

What does "deterministic=True" in stable baselines3 library means?

and potentially related to another question from me

Reinforcement learning deterministic policies worse than non deterministic policies

GHE
  • 75
  • 4

1 Answers1

1

All that deterministic does is returns a mode of a distribution instead of a sample

    def get_actions(self, deterministic: bool = False) -> th.Tensor:
        """
        Return actions according to the probability distribution.
        :param deterministic:
        :return:
        """
        if deterministic:
            return self.mode()
        return self.sample()

It does not matter what action space you use. From math perspective there is always one action being taken in RL. The fact that your action space "looks" multi-dimensional just makes the actual action space exponentially large, that's all. So depending on specific agent, what will happen is that you will often have either independent distribution per action group (e.g. a separate head in a neural network), and thus each group will get its "most likely action", or if you had a more advanced neural network one could parametrise a full joint distribution with say an autoregressive model etc.

In short, yes it makes the same sense as it would make in other action space, the question is more in how you parametrise the policy, and with naive parametrisation things are less expressive, but in practise use in many agents without any issues.

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Thank you for the answer and for pointing towards the code! To be sure that I get you right: Let's say I define an action space of type Box with 20 dimensions `Box(low=-1, high=1, shape=(20,))`. Does that mean that behind each of the 20 dimensions / actions (is this what you refer to as an "action group"?) there is a distribution from which the action is either sampled in case of `deterministic=False` or the action is chosen by the mode of the distribution with `deterministic=True`? – GHE Jul 24 '22 at 15:48
  • 1
    Yes, essentially there should be one joint distribution P(a1=x1, a2=x2, ..., a20=x20 | s). However for simplicity it is often parametrised (for computational reasons) as P(a1=x1, a2=x2, ..., a20=x20 | s) = P(a1=x2|s)*...*P(a20=x20). And now with determinisic=True we will choose the mode of corresponding distribution, which will also decompose to mode of each small distributions; or sample from it (which also decomposes to sampling from small ones) – lejlot Jul 24 '22 at 16:03