In Q Learning, how can you ever actually get a Q value? Wouldn't Q(s,a) just go on forever?

Question

I've been studying up on reinforcement learning, but the thing I don't understand is how a Q value is ever calculated. If you use the Bellman equation Q(s,a) = r + γ*max(Q(s',a')), would't it just go on forever? Because Q(s',a') would need the Q value of one timestep further, and that would just continue on and on. How does it end?

score 1 · Accepted Answer · edited May 23 '17 at 12:13

In Reinforcement Learning you normally try to find a policy (the best action to take in a specific state), and the learning process ends when the policy does not change anymore or the value function (representing the expected reward) has converged.

You seem to confuse Q-learning and Value Iteration using the Bellman equation. Q-learning is a model-free technique where you use obtained reward to update Q:

Here the direct reward r_t+1 is the reward obtained after having done action a_t in state s_t. α is the learning rate that should be between 0 and 1, if it is 0 no learning is done, if it is 1 only the newest reward is taken into account.

Value iteration with the Bellman equation:

Where a model P_a(s,s') is required, also defined as P(s'|s,a), which is the probability of going from state s to s' using action a. To check if the value function is converged, normally the value function V_t+1 is compared to V_t for all states and if it is smaller than a small value (ε) the policy is said to be converged:

In Q Learning, how can you ever actually get a Q value? Wouldn't Q(s,a) just go on forever?

1 Answers1