Epsilon and learning rate decay in epsilon greedy q learning

Question

I understand that epsilon marks the trade-off between exploration and exploitation. At the beginning, you want epsilon to be high so that you take big leaps and learn things. As you learn about future rewards, epsilon should decay so that you can exploit the higher Q-values you've found.

However, does our learning rate also decay with time in a stochastic environment? The posts on SO that I've seen only discuss epsilon decay.

How do we set our epsilon and alpha such that values converge?

score 29 · Accepted Answer · answered Nov 08 '18 at 07:03

At the beginning, you want epsilon to be high so that you take big leaps and learn things

I think you have have mistaken epsilon and learning rate. This definition is actually related to the learning rate.

Learning rate decay

Learning rate is how big you take a leap in finding optimal policy. In the terms of simple QLearning it's how much you are updating the Q value with each step.

Higher alpha means you are updating your Q values in big steps. When the agent is learning you should decay this to stabilize your model output which eventually converges to an optimal policy.

Epsilon Decay

Epsilon is used when we are selecting specific actions base on the Q values we already have. As an example if we select pure greedy method ( epsilon = 0 ) then we are always selecting the highest q value among the all the q values for a specific state. This causes issue in exploration as we can get stuck easily at a local optima.

Therefore we introduce a randomness using epsilon. As an example if epsilon = 0.3 then we are selecting random actions with 0.3 probability regardless of the actual q value.

Find more details on epsilon-greedy policy here.

In conclusion learning rate is associated with how big you take a leap and epsilon is associated with how random you take an action. As the learning goes on both should decayed to stabilize and exploit the learned policy which converges to an optimal one.

Just to clarify, is a policy epsilon-greedy if epsilon decays? Or is it epsilon-soft in that case, or both? — Johan, Feb 10 '20 at 19:17

score 7 · Answer 2 · answered Jul 03 '20 at 12:05

As the answer of Vishma Dias described learning rate [decay], I would like to elaborate the epsilon-greedy method that I think the question implicitly mentioned a decayed-epsilon-greedy method for exploration and exploitation.

One way to balance between exploration and exploitation during training RL policy is by using the epsilon-greedy method. For example, epsilon =0.3 means with a probability=0.3 the output action is randomly selected from the action space, and with probability=0.7 the output action is greedily selected based on argmax(Q).

An improved of the epsilon-greedy method is called a decayed-epsilon-greedy method. In this method, for example, we train a policy with totally N epochs/episodes (which depends on the problem specific), the algorithm initially sets epsilon = pinit (e.g., pinit =0.6), then gradually decreases to end at epsilon = pend (e.g., pend =0.1) over nstep training epoches/episodes. Specifically, at the initial training process, we let the model more freedom to explore with a high probability (e.g., pinit =0.6), and then gradually decrease the epsilon with a rate r over training epochs/episodes with the following formula:

rate

With this more flexible choice to end at the very small exploration probability pend , after nstep the training process will focus more on exploitation (i.e., greedy) while it still can explore with a very small probability when the policy is approximately converged.

You can see the advantage of the decayed-epsilon-greedy method in this post.

Mahdyfo · Answer 3 · 2023-06-13T11:09:59.720

To have a dynamic epsilon, I made this formula for my application. It takes tries and rewards into account. The more rewards and tries means the less epsilon. You can adjust it using ExploreRate parameter.

Epsilon

Tries: Number of total tries of all strategies or all machines (step)

Rewards: Total rewards or total count of successes

Some Examples:

ExploreRate = 1000, Tries = 10, Rewards = 9 -> Epsilon = 1

Because tries are too few.

ExploreRate = 1000, Tries = 100, Rewards = 90 -> Epsilon = 0.11

Small, because we got a lot of conversions that is a sign of verification.

ExploreRate = 1000, Tries = 100, Rewards = 9 -> Epsilon = 1

Although we had a lot of tries, but it is not reliable because of low rewards. So we continue exploring.

ExploreRate = 100, Tries = 100, Rewards = 9 -> Epsilon = 0.11

The lower the ExploreRate, the faster convergence.

By increasing ExploreRate, it tends to explore more and converge slower.

By decreasing ExploreRate, it converges faster.

You can alternatively use learning rate instead of explore rate:

Epsilon

Epsilon and learning rate decay in epsilon greedy q learning

3 Answers3

Linked