Deep Q-learning modification

Question

@Edit:

I'm trying to create an agent to play the game of Tetris, using a convolutional nnet that takes the board state + current piece as input. From what I've read, Deep Q-learning is not very good at this, which I just confirmed.

@end Edit

Suppose that an agent is learning a policy to play a game, where each game step can be represented as

s, a, r, s', done

representing

state, action, reward, next state, game over

In the Deep Q-learning algorithm, the agent is in state s and takes some action a (following an epsilon-greedy policy), observes a reward r and gets to the next state s'.

The agent acts like this:

# returns an action index
get_action(state, epsilon)
   if random() < epsilon
       return random_action_index
   else
      return argmax(nnet.predict(state))

The parameters are updated by greedily observing the max Q-value in state s', so we have

# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
   target = reward
else
   target = reward + gamma * max(nnet.predict(s_))

prediction[a] = target

The [prediction, target] is feed to some nnet for weight update. So this nnet gets a state s as input, and outputs a vector of q-values with dimension n_actions. This is all clear to me.

Now, suppose that my state-actions are so noise, that this approach will simply not work. So, instead of outputting a vector of dimension n_actions, my nnet output is a single value, representing the "state-quality" (how desirable is that state).

Now my agent acts like this:

# returns an action based on how good the next state is
get_action(state, epsilon):
    actions = []
    for each action possible in state:
         game.deepcopy().apply(action)
         val = nnet.predict(game.get_state())
         action.value = val
         actions.append(action)

    if random() < epsilon
        return randomChoice(actions)
    else
        return action with_max_value from actions

And my [prediction, target] is like this:

# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
   target = reward
else
   target = reward + gamma * nnet.predict(s_)

I have some questions regarding this second algorithm:

a) Does it makes sense to act non greedily sometimes?

Intuitively no, because if I land in a bad state, it was probably because of a bad random action, not because the previous state was 'bad'. The Q-learning update will adjust the bad action, but this second algorithm will wrongly adjust the previous state.

b) What kind of algorithm is this? Where does it fits in Reinforcement Learning?

c) In the case of Tetris, the state almost never repeats, so what can I do in this case? Is that the reason deep q-learning fails here?

This may look confusing, but the algorithm actually works. I can provide extra details if necessary, thank you!

I decided to write an answer to the question here now. For future reference, I suspect these kinds of questions will be more on-topic on https://ai.stackexchange.com/ rather than StackOverflow. — Dennis Soemers, Oct 05 '18 at 12:26

score 0 · Accepted Answer · answered Oct 05 '18 at 12:25

Now, suppose that my state-actions are so noise, that this approach will simply not work. So, instead of outputting a vector of dimension n_actions, my nnet output is a single value, representing the "state-quality" (how desirable is that state).

Now my agent acts like this:
# returns an action based on how good the next state is
get_action(state, epsilon):
    actions = []
    for each action possible in state:
         game.apply(action)
         val = nnet.predict(game.get_state())
         action.value = val
         actions.append(action)

    if random() < epsilon
        return randomChoice(actions)
    else
        return action with_max_value from actions

First a quick note on that pseudocode: I don't think that would work, because you don't simulate the effects of the different actions on copies of the game state, but just on the game state directly. You'd probably want to create separate copies of the game state first, and run each action once on a different copy.

Anyway, this kind of algorithm is generally assumed not to be possible in Reinforcement Learning settings. In RL, we generally operate under the assumption that we don't have a "simulator" or "forward model" or anything like that. We normally assume that we have an agent situated in a real environment in which we can generate experience that can be used to learn from. Under this assumption, we cannot implement that kind of for each action possible in state loop which simulates what would happen if we were to execute different actions in the same game state. The assumption is that we first have to pick one action, execute it, and then learn from that particular trajectory of experience; we can no longer "go back", imagine that we had selected a different action, and also learn from that trajectory.

Of course, in practice doing this is often possible, because we often do actually have a simulator (for example a robot simulator, or a game, etc.). The majority of research in RL still makes the assumption that we do not have such a simulator, because this leads to algorithms that may eventually be useable in "real-world" situations (e.g., real-world physical robots). Implementing the idea you describe above actually means you're moving more towards search algorithms (such as Monte-Carlo Tree Search), rather than Reinforcement Learning algorithms. It means that your approach is limited to scenarios in which you do have a simulator available (e.g., games).

a) Does it makes sense to act non greedily sometimes?

Even though you include the search-like process of looping through all actions and simulating all their effects, I suspect you'll still have to have some form of exploration if you want to converge to good policies, so you'll have to act non-greedily sometimes. Yes, it looks like this will cause your algorithm to converge to something different from the traditional interpretation of an "optimal policy". This is not too much of a problem if your epsilon is fairly low though. In practice, it will likely tend to be a slightly "safer" policy that is learned. See also my answer to this other question.

b) What kind of algorithm is this? Where does it fits in Reinforcement Learning?

Aside from my discussion above on how this actually moves a bit towards the domain of Search algorithms rather than RL algorithms, it also looks to me like this would be an on-policy algorithm rather than an off-policy algorithm (standard Q-learning is off-policy, because it learns about the greedy policy whilst generating experience through a non-greedy policy). This distinction is also discussed in detail by most of the answers to the question I linked to above.

Maybe my pseudo-code is not very clear, but I'm not simulating, I'm just getting either a random move or a best move and then using it, just like q-learning. — Fernando, Oct 05 '18 at 18:46
@Fernando That's not what the code looks like to me. You have a for-loop through actions, and for each `action`, you simulate executing that action (using `game.apply(action)`) and evaluate the successor state (using `val = nnet.predict(game.get_state())`). Doesn't `game.apply(action)` essentially perform a one-step lookahead, doesn't that give you the successor state that you would end up in if you were to execute `action` in the current state? — Dennis Soemers, Oct 05 '18 at 19:19
In the real code I actually undo the move, I've edited the post to reflect that. I loop-apply through the moves just to get their value, just like q-learning would do in a single model.predict(state). Sorry for the confusion! — Fernando, Oct 05 '18 at 19:24
@Fernando Ok, right. But `game.deepcopy().apply(action)` essentially does perform simulation, does it not? Otherwise it would be impossible to ever pick "best" actions based on Value estimates for only states. You need Value estimates for state-action pairs if you want to be able to pick "best" actions without simulation — Dennis Soemers, Oct 05 '18 at 19:25
Note that in this case with "simulation" I just mean the same as "lookahead", which can just be a single step ahead. I get that you're not doing a full-episode simulation. — Dennis Soemers, Oct 05 '18 at 19:27
Ah, I see your point now, ok. But deep q-learning is not working in this case. So my code is not a 'value function' learning? — Fernando, Oct 05 '18 at 19:28
Well, intuitively it looks more to me like some sort of... Deep Sarsa rather than Deep Q, it has more of an on-policy flavour. The one-step lookahead that you do perform turns your state-value function in... some form of state-action-value learning again... at least, that's my intuition. Would have to examine everything in great detail from first principles to tell for sure. That said, Tic Tac Toe is a 2-player game, which is not yet taken into account at all. How well that's going to work also depends very much on the rest of your setup (e.g., who do you play against?) — Dennis Soemers, Oct 06 '18 at 08:31

Deep Q-learning modification

@Edit:

@end Edit

1 Answers1