2

I'm trying to apply reinforcement learning to a problem where the agent interacts with continuous numerical outputs using a recurrent network. Basically, it is a control problem where two outputs control how an agent behave.

I define an policy as epsilon greedy with (1-eps) of the time using the output control values, and eps of the time using the output values +/- a small Gaussian perturbation. In this sense the agent can explore. In most of the reinforcement literature I see that policy learning requires discrete actions which can be learned with the REINFORCE (Williams 1992) algorithm, but I'm unsure what method to use here.

At the moment what I do is use masking to only learn the top choices using an algorithm based on Metropolis Hastings to decide if a transition is goes toward the optimal policy. Pseudo code:

input: rewards, timeIndices
// rewards in (0,1) and optimal is 1 
// relate rewards to likelihood via L(r) = exp(-|r - 1|/std)
// r <= 1 => |r - 1| = 1 - r
timeMask = zeros(timeIndices.length)
neglogLi =  (1 - mean(rewards)) / std
// Go through random order of reward to approximate Markov process
for r,idx in shuffle(rewards, timeIndices):
    neglogLj = (1 - r)/std 
    if neglogLj < neglogLi || log(random.uniform()) < neglogLi - neglogLj:
        // Accept transition, i.e. learn this action
        targetMask[idx] = 1
        neglogLi = neglogLj

This provides a targetMask with ones for the actions that will be learned using standard backprop.

Can someone inform me the proper or better way?

Josh Albert
  • 1,064
  • 13
  • 16
  • Does this answer your question? [How can I apply reinforcement learning to continuous action spaces?](https://stackoverflow.com/questions/7098625/how-can-i-apply-reinforcement-learning-to-continuous-action-spaces) – DBear Dec 11 '21 at 03:04

1 Answers1

2

Policy gradient methods are good for learning continuous control outputs. If you look at http://rll.berkeley.edu/deeprlcourse/#lectures, the Feb 13 lecture as well as the March 8 through March 15 lectures might be useful to you. Actor Critic methods are covered there, as well.

Ryan Stout
  • 1,038
  • 6
  • 13