@Edit:
I'm trying to create an agent to play the game of Tetris, using a convolutional nnet that takes the board state + current piece as input. From what I've read, Deep Q-learning is not very good at this, which I just confirmed.
@end Edit
Suppose that an agent is learning a policy to play a game, where each game step can be represented as
s, a, r, s', done
representing
state, action, reward, next state, game over
In the Deep Q-learning algorithm, the agent is in state s and takes some action a (following an epsilon-greedy policy), observes a reward r and gets to the next state s'.
The agent acts like this:
# returns an action index
get_action(state, epsilon)
if random() < epsilon
return random_action_index
else
return argmax(nnet.predict(state))
The parameters are updated by greedily observing the max Q-value in state s', so we have
# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
target = reward
else
target = reward + gamma * max(nnet.predict(s_))
prediction[a] = target
The [prediction, target] is feed to some nnet for weight update. So this nnet gets a state s as input, and outputs a vector of q-values with dimension n_actions. This is all clear to me.
Now, suppose that my state-actions are so noise, that this approach will simply not work. So, instead of outputting a vector of dimension n_actions, my nnet output is a single value, representing the "state-quality" (how desirable is that state).
Now my agent acts like this:
# returns an action based on how good the next state is
get_action(state, epsilon):
actions = []
for each action possible in state:
game.deepcopy().apply(action)
val = nnet.predict(game.get_state())
action.value = val
actions.append(action)
if random() < epsilon
return randomChoice(actions)
else
return action with_max_value from actions
And my [prediction, target] is like this:
# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
target = reward
else
target = reward + gamma * nnet.predict(s_)
I have some questions regarding this second algorithm:
a) Does it makes sense to act non greedily sometimes?
Intuitively no, because if I land in a bad state, it was probably because of a bad random action, not because the previous state was 'bad'. The Q-learning update will adjust the bad action, but this second algorithm will wrongly adjust the previous state.
b) What kind of algorithm is this? Where does it fits in Reinforcement Learning?
c) In the case of Tetris, the state almost never repeats, so what can I do in this case? Is that the reason deep q-learning fails here?
This may look confusing, but the algorithm actually works. I can provide extra details if necessary, thank you!