3

I am just getting start with deep reinforcement learning and i am trying to crasp this concept.

I have this deterministic bellman equation

deterministic bellman equation

When i implement stochastacity from the MDP then i get 2.6a

Implement MDP in deterministic bellman

My equation is this assumption correct. I saw this implementation 2.6a without a policy sign on the state value function. But to me this does not make sense due to i am using the probability of which different next steps i could end up in. Which is the same as saying policy, i think. and if yes 2.6a is correct, can i then assume that the rest (2.6b and 2.6c) because then i would like to write the action state function like this:

State action function with policy

The reason why i am doing it like this is because i would like to explain myself from a deterministic point of view to a non-deterministic point of view.

I hope someone out there can help on this one!

Best regards Søren Koch

Søren Koch
  • 145
  • 1
  • 1
  • 10

2 Answers2

2

Yes, your assumption is completely right. In the Reinforcement Learning field, a value function is the return obtained by starting for a particular state and following a policy π . So yes, strictly speaking, it should be accompained by the policy sign π .

The Bellman equation basically represents value functions recursively. However, it should be noticed that there are two kinds of Bellman equations:

  • Bellman optimality equation, which characterizes optimal value functions. In this case, the value function it is implicitly associated with the optimal policy. This equation has the non linear maxoperator and is the one you has posted. The (optimal) policy dependcy is sometimes represented with an asterisk as follows: enter image description here Maybe some short texts or papers omit this dependency assuming it is obvious, but I think any RL text book should initially include it. See, for example, Sutton & Barto or Busoniu et al. books.

  • Bellman equation, which characterizes a value function, in this case associated with any policy π: enter image description here

In your case, your equation 2.6 is based on the Bellman equation, therefore it should remove the max operator and include the sum over all actions and possible next states. From Sutton & Barto (sorry by the notation change wrt your question, but I think it's understable): enter image description here

Pablo EM
  • 6,190
  • 3
  • 29
  • 37
  • I've updated the answer to add extra explanation. I think now my point clearer. Thanks for the comment! – Pablo EM Feb 25 '18 at 21:00
  • I still disagree with "In your case, your equation 2.6 is based on the Bellman equation, therefore it should remove the `max` operator and include the sum over all actions and possible next states". He started out (given) with the optimal value function (typically we assume that one also if there's no supwerscript/subscript). Then he tries to rewrite it for the nondeterministic case instead. This does not mean that he should suddenly move away from the optimal value function. – Dennis Soemers Feb 26 '18 at 09:58
  • Right, he should decide if he wants to use the value function or the optimal value function. But I think that the response points out OP misconception and the possible right directions. Maybe I could update the response if the user provides some feedback. Any case, thanks again for the constructive comments Dennis :) – Pablo EM Feb 26 '18 at 10:20
2

No, the value function V(s_t) does not depend on the policy. You see in the equation that it is defined in terms of an action a_t that maximizes a quantity, so it is not defined in terms of actions as selected by any policy.

In the nondeterministic / stochastic case, you will have that sum over probabilities multiplied by state-values, but this is still independent from any policy. The sum only sums over different possible future states, but every multiplication involves exactly the same (policy-independent) action a_t. The only reason why you have these probabilities is because in the nondeterministic case a specific action in a specific state can lead to one of multiple different possible states. This is not due to policies, but due to stochasticity in the environment itself.


There does also exist such a thing as a value function for policies, and when talking about that a symbol for the policy should be included. But this is typically not what is meant by just "Value function", and also does not match the equation you have shown us. A policy-dependent function would replace the max_{a_t} with a sum over all actions a, and inside the sum the probability pi(s_t, a) of the policy pi selecting action a in state s_t.

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55