1

I am trying to understand how to use mdptoolbox and had a few questions.

What does 20 mean in the following statement?

P, R = mdptoolbox.example.forest(10, 20, is_sparse=False)

I understand that 10 here denotes the number of possible states. What does 20 mean here? Does it represent the total number of actions per state? I want to restrict the MDP to exactly 2 actions per state. How could I do this?

The shape of P returned above is (2, 10, 10). What does 2 represent here? No matter what values I use for total states and actions, it is always 2.

Aditya
  • 533
  • 3
  • 11
Amanda
  • 2,013
  • 3
  • 24
  • 57

1 Answers1

2

The code which you are running is correct, but what you are using is an example from the toolbox.

Please go through the documentation carefully.

In the following code:

P, R = mdptoolbox.example.forest(10, 20, is_sparse=False)

The second argument is not an action-argument for the MDP. Its documentation explains the second argument as follows:

The reward when the forest is in its oldest state and action ‘Wait’ is performed. Default: 4.

In your case, the value of the reward is passed as 20 when the forest is in the oldest state and the action Wait is performed.

In case of this example, the forest is managed by two actions: ‘Wait’ and ‘Cut’. Please refer this documentation for more details. Since, 2 actions possible, the transition probability matrix P returned by this function is also having the first dimension size as 2. You do not need to manually restrict the action space dimension to 2.

To understand the use of this toolbox, you should also go through this link.

Aditya
  • 533
  • 3
  • 11
  • Okay, so the two actions `wait` and `cut` are the 2 available actions per state? – Amanda Jun 08 '19 at 18:36
  • Yes, these 2 actions are available per state. – Aditya Jun 08 '19 at 18:41
  • When I keep the number of states as `2`, the probability matrix shape is (2,2,2) and looks like https://pasteboard.co/Iiv9YTB.png . I cannot understand this output and relate it to the reward matrix returned which looks like https://pasteboard.co/Iivaxfp.png . Could you please explain? – Amanda Jun 08 '19 at 18:54
  • Your reward matrix is of the dimension `(state_size, action_size)`, i.e., reward received when you take an action in a state. It gives you the output for the reward `r(i, j)` for the `(state_i, action_j)`. You have `2 states` and `2 actions`, hence the `(2, 2)` reward matrix `R`. For state transition you get the `action_size` matrices of dimensions `(state_size, state_size)`. Therefore, the dimensions of matrix `P` are `(action_size, state_size, state_size)`. That is after taking `action_i` in `state_j`, possibility of going from `state_j` to `state_k`. – Aditya Jun 08 '19 at 19:13