I've been studying up on reinforcement learning, but the thing I don't understand is how a Q value is ever calculated. If you use the Bellman equation Q(s,a) = r + γ*max(Q(s',a'))
, would't it just go on forever? Because Q(s',a')
would need the Q value of one timestep further, and that would just continue on and on. How does it end?
1 Answers
In Reinforcement Learning you normally try to find a policy (the best action to take in a specific state), and the learning process ends when the policy does not change anymore or the value function (representing the expected reward) has converged.
You seem to confuse Q-learning and Value Iteration using the Bellman equation. Q-learning is a model-free technique where you use obtained reward to update Q:
Here the direct reward rt+1 is the reward obtained after having done action at in state st. α is the learning rate that should be between 0 and 1, if it is 0 no learning is done, if it is 1 only the newest reward is taken into account.
Value iteration with the Bellman equation:
Where a model Pa(s,s') is required, also defined as P(s'|s,a), which is the probability of going from state s to s' using action a. To check if the value function is converged, normally the value function Vt+1 is compared to Vt for all states and if it is smaller than a small value (ε) the policy is said to be converged:
See also: