21

ϵ-greedy policy

I know the Q-learning algorithm should try to balance between exploration and exploitation. Since I'm a beginner in this field, I wanted to implement a simple version of exploration/exploitation behavior.

Optimal epsilon value

My implementation uses the ϵ-greedy policy, but I'm at a loss when it comes to deciding the epsilon value. Should the epsilon be bounded by the number of times the algorithm have visited a given (state, action) pair, or should it be bounded by the number of iterations performed?

My suggestions:
  1. Lower the epsilon value for each time a given (state, action) pair has been encountered.
  2. Lower the epsilon value after a complete iteration has been performed.
  3. Lower the epsilon value for each time we encounter a state s.

Much appreciated!

OccamsMan
  • 235
  • 1
  • 2
  • 7
  • Did you make any progress in that manner? Did you try your different suggestions and compare it to the accepted answer? I've experimented with positive constant epsilon and decaying epsilon and got acceptable results, but I'm curious to see if having epsilon as function of the number of visits of the current (state, action) pair wouldn't give a better result. It makes sense to me to only decay epsilon when considering a (state, action) pair the agent has already visited several times while keep it higher for a (state, action) pair the agent never visisted yet. – Romain G Jan 06 '17 at 12:23
  • Yes, I've tried `Regret minimization` as well. This speeds up the convergence rate, but at the cost of not always being able to find the best solution. At really big problem instances, I tend to prefer the regret minimization approach since this quickly guides the search toward better solutions – OccamsMan Mar 08 '17 at 06:17

2 Answers2

27

Although in many simple cases the εk is kept as a fixed number in range 0 and 1, you should know that: Usually, the exploration diminishes over time, so that the policy used asymptotically becomes greedy and therefore (as Qk → Q∗) optimal. This can be achieved by making εk approach 0 as k grows. For instance, an ε -greedy exploration schedule of the form εk = 1/k diminishes to 0 as k → ∞, while still satisfying the second convergence condition of Q-learning, i.e., while allowing infinitely many visits to all the state-action pairs (Singh et al., 2000).

What I do usually is this: set the initial alpha = 1/k (consider the initial k = 1 or 2) after you go trial by trial as k increases the alpha will decrease. it also keeps the convergence guaranteed.

NKN
  • 6,482
  • 6
  • 36
  • 55
  • 4
    Also known as epsilon-decay. – danelliottster Apr 23 '14 at 13:38
  • @NKN what is k in epsilon * k? – maddie Nov 07 '18 at 21:32
  • Actually I think I understand that k is time steps here. But do you decay epsilon and alpha for optimal q learning? – maddie Nov 07 '18 at 21:51
  • 1
    @Matt k is the time step, i.e. εk is the kth ε. Decaying alpha is a good idea to decrease the step size in the learning (updating the values) to avoid jumps when converging to the optimal value. The decay constants can differ for the two terms though. – NKN Nov 09 '18 at 19:59
-1

It is usually wise to simply set ε to a positive constant, unless you have a good reason not to.

Don Reba
  • 13,814
  • 3
  • 48
  • 61
  • 3
    Empirically: Shouldn't the agent be less likely to accept exploration as the Q value table are converging towards the true transition tables? Example: a game agent should prefere its emergent perfect strategy instead of continuing to play poor moves (exploration). – OccamsMan Apr 02 '14 at 10:56