Recently, I have tried to apply the naive policy gradient method to my problem. However, I found that the difference between different outputs of the last layer of the neural network is huge, which means that after applying the softmax layer, only one action will be marked as 1, and other actions will be marked as 0. For instance, the output of the last layer is shown below:
[ 242.9629, -115.6593, 63.3984, 226.1815, 131.5903, -316.6087,
-205.9341, 98.7216, 136.7644, 266.8708, 19.2289, 47.7531]
After applying the softmax function, it is clear that only one action will be chosen.
[4.1395e-11, 0.0000e+00, 0.0000e+00, 2.1323e-18, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00]
This problem severely affects the final performance as the neural network will try constant action only after a few steps. Therefore, is there any way to solve this problem?
(By the way, even if I have tried to give some negative rewards to the neural network, the actions chosen by it are still unchanged.)