How to solve the zero probability problem in the policy gradient?

Question

Recently, I have tried to apply the naive policy gradient method to my problem. However, I found that the difference between different outputs of the last layer of the neural network is huge, which means that after applying the softmax layer, only one action will be marked as 1, and other actions will be marked as 0. For instance, the output of the last layer is shown below:

[ 242.9629, -115.6593,   63.3984,  226.1815,  131.5903, -316.6087,
 -205.9341,   98.7216,  136.7644,  266.8708,   19.2289,   47.7531]

After applying the softmax function, it is clear that only one action will be chosen.

[4.1395e-11, 0.0000e+00, 0.0000e+00, 2.1323e-18, 0.0000e+00, 0.0000e+00,
 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00]

This problem severely affects the final performance as the neural network will try constant action only after a few steps. Therefore, is there any way to solve this problem?

(By the way, even if I have tried to give some negative rewards to the neural network, the actions chosen by it are still unchanged.)

My training curve is shown as follows:

score 1 · Accepted Answer · answered Nov 12 '20 at 11:24

In fact, there is no deterministic way to solve this problem as this is an age-old problem in the optimization domain called "exploration-exploitation dilemma". Specifically, in reinforcement learning, there are two simple ways to solve this problem:

Firstly, reducing the learning rate is the simplest way to solve this problem. With a lower learning rate, the policy network can explore more different actions, and thus avoid being stuck at a local optimum.
Secondly, adding the policy entropy term into the loss function is another way to solve this problem. A good example of this idea is the soft actor-critic (SAC) algorithm.

Both methods have been validated in my task, and both of them effectively alleviate the pre-mature problem. However, both of them have a parameter that needs to be tuned by hand, which increases the complexity of my algorithm.

By the way, similar to Q-Learning, we can also use the epsilon-greedy mechanism to encourage the agent to explore more actions. However, this is not an elegant method to solve this problem because it is hard to determine the epsilon value.

score -1 · Answer 2 · answered Nov 03 '20 at 12:45

-1

As far as I know, PG is usually adopted to handle continuous action. You may need to try Value-based algs.
Is the softmax implemented right? paste your code here or some metric of learning process may be help

answered Nov 03 '20 at 12:45

Jarvis

1

1. It is very common to use PG in discrete action spaces. 2. I am almost certain that my implementation is correct, I just applied "torch.softmax" on the output of the last linear layer. – HZ-VUW Nov 03 '20 at 13:01
I am not sure what your ques are. With the softmax, it will be very likely to choose the larger logits one. There are some options to replace the softmax to solve this problem. But I forget which paper mention it. – Jarvis Nov 03 '20 at 13:45
The description just claim your PG doesnt work, but I cannot get what makes it not work. Lack of exploration? Sparse reward? The learning process went wrong? you may at least provide your curves – Jarvis Nov 03 '20 at 13:50
I have supplemented the training curve in my question description. Firstly, it is easy to see that the above-mentioned problem will lead to a lack of exploration. However, I have no idea how to fix this problem. – HZ-VUW Nov 03 '20 at 14:06
1

PG is an on-ploicy method, it can only use current policy’s data. For continuous action, sampling from gaussian, add OU noise are ways to introduce noise. If you need more stochastic policy, add an entropy term like SAC may be helpful. Or maybe the naive pg just not good enough to solve the problem(sparse reward?long horizeon..) – Jarvis Nov 03 '20 at 15:26
SAC sounds like a reasonable way to avoid premature convergence in reinforcement learning even though I am not sure whether it will work in my case. Moreover, I feel that a simple policy gradient algorithm can solve my problem (reinforcement learning-based Auto-ML, such as "Efficient Neural Architecture Search via Parameters Sharing." International Conference on Machine Learning. 2018.). Anyway, I will try to use regularization in my algorithm and see if there are any improvements. – HZ-VUW Nov 04 '20 at 06:19

How to solve the zero probability problem in the policy gradient?

2 Answers2