4

In the context of Double Q or Deuling Q Networks, I am not sure if I fully understand the difference. Especially with V. What exactly is V(s)? How can a state have an inherent value?

If we are considering this in the context of trading stocks lets say, then how would we define these three variables?

Maxim
  • 52,561
  • 27
  • 155
  • 209
Rashan Arshad
  • 247
  • 3
  • 11
  • https://datascience.stackexchange.com/questions/9832/what-is-the-q-function-and-what-is-the-v-function-in-reinforcement-learning this may be interesting as well for future readers – Slim Shady Nov 14 '21 at 02:43

1 Answers1

11
  • No matter what network can talk about, the reward is an inherent part of the environment. This is the signal (in fact, the only signal) that an agent receives throughout its life after making actions. For example: an agent that plays chess gets only one reward at the end of the game, either +1 or -1, all other times the reward is zero.

    Here you can see a problem in this example: the reward is very sparse and is given just once, but the states in a game are obviously very different. If an agent is in a state when it has the queen while the opponent has just lost it, the chances of winning are very high (simplifying a little bit, but you get an idea). This is a good state and an agent should strive to get there. If on the other hand, an agent lost all the pieces, it is a bad state, it will likely lose the game.

  • We would like to quantify what actually good and bad states are, and here comes the value function V(s). Given any state, it returns a number, big or small. Usually, the formal definition is the expectation of the discounted future rewards, given a particular policy to act (for the discussion of a policy see this question). This makes perfect sense: a good state is such one, in which the future +1 reward is very probable; the bad state is quite the opposite -- when the future -1 is very probable.

    Important note: the value function depends on the rewards and not just for one state, for many of them. Remember that in our example the reward for almost all states is 0. Value function takes into account all future states along with their probabilities.

    Another note: strictly speaking the state itself doesn't have a value. But we have assigned one to it, according to our goal in the environment, which is to maximize the total reward. There can be multiple policies and each will induce a different value function. But there is (usually) one optimal policy and the corresponding optimal value function. This is what we'd like to find!

  • Finally, the Q-function Q(s, a) or the action-value function is the assessment of a particular action in a particular state for a given policy. When we talk about an optimal policy, action-value function is tightly related to the value function via Bellman optimality equations. This makes sense: the value of an action is fully determined by the value of the possible states after this action is taken (in the game of chess the state transition is deterministic, but in general it's probabilistic as well, that's why we talk about all possible states here).

    Once again, action-value function is a derivative of the future rewards. It's not just a current reward. Some actions can be much better or much worse than others even though the immediate reward is the same.


Speaking of the stock trading example, the main difficulty is to define a policy for the agent. Let's imagine the simplest case. In our environment, a state is just a tuple (current price, position). In this case:

  • The reward is non-zero only when an agent actually holds a position; when it's out of the market, there is no reward, i.e. it's zero. This part is more or less easy.
  • But the value and action-value functions are very non-trivial (remember it accounts only for the future rewards, not the past). Say, the price of AAPL is at $100, is it good or bad considering future rewards? Should you rather buy or sell it? The answer depends on the policy...

    For example, an agent might somehow learn that every time the price suddenly drops to $40, it will recover soon (sounds too silly, it's just an illustration). Now if an agent acts according to this policy, the price around $40 is a good state and it's value is high. Likewise, the action-value Q around $40 is high for "buy" and low for "sell". Choose a different policy and you'll get a different value and action-value functions. The researchers try to analyze the stock history and come up with sensible policies, but no one knows an optimal policy. In fact, no one even knows the state probabilities, only their estimates. This is what makes the task truly difficult.

Maxim
  • 52,561
  • 27
  • 155
  • 209