ϵ-greedy policy
I know the Q-learning algorithm should try to balance between exploration and exploitation. Since I'm a beginner in this field, I wanted to implement a simple version of exploration/exploitation behavior.
Optimal epsilon valueMy implementation uses the ϵ-greedy policy, but I'm at a loss when it comes to deciding the epsilon value. Should the epsilon be bounded by the number of times the algorithm have visited a given (state, action) pair, or should it be bounded by the number of iterations performed?
My suggestions:- Lower the epsilon value for each time a given (state, action) pair has been encountered.
- Lower the epsilon value after a complete iteration has been performed.
- Lower the epsilon value for each time we encounter a state s.
Much appreciated!