To have a dynamic epsilon, I made this formula for my application. It takes tries and rewards into account. The more rewards and tries means the less epsilon. You can adjust it using ExploreRate
parameter.

Tries
: Number of total tries of all strategies or all machines (step)
Rewards
: Total rewards or total count of successes
Some Examples:
ExploreRate = 1000
, Tries = 10
, Rewards = 9
-> Epsilon = 1
Because tries are too few.
ExploreRate = 1000
, Tries = 100
, Rewards = 90
-> Epsilon = 0.11
Small, because we got a lot of conversions that is a sign of verification.
ExploreRate = 1000
, Tries = 100
, Rewards = 9
-> Epsilon = 1
Although we had a lot of tries, but it is not reliable because of low rewards. So we continue exploring.
ExploreRate = 100
, Tries = 100
, Rewards = 9
-> Epsilon = 0.11
The lower the ExploreRate, the faster convergence.
By increasing ExploreRate
, it tends to explore more and converge slower.
By decreasing ExploreRate
, it converges faster.
You can alternatively use learning rate instead of explore rate:
