Questions tagged [policy-gradient-descent]
44 questions
9
votes
3 answers
What Loss Or Reward Is Backpropagated In Policy Gradients For Reinforcement Learning?
I have made a small script in Python to solve various Gym environments with policy gradients.
import gym, os
import numpy as np
#create environment
env = gym.make('Cartpole-v0')
env.reset()
s_size = len(env.reset())
a_size = 2
#import my neural…

S2673
- 269
- 4
- 15
7
votes
0 answers
Why does my agent always takes a same action in DQN - Reinforcement Learning
I have trained an RL agent using DQN algorithm. After 20000 episodes my rewards are converged. Now when I test this agent, the agent is always taking the same action , irrespective of state. I find this very weird. Can someone help me with this. Is…

chink
- 1,505
- 3
- 28
- 70
6
votes
1 answer
PyTorch PPO implementation for Cartpole-v0 getting stuck in local optima
I have implemented PPO for Cartpole-VO environment. However, it does not converge in certain iterations of the game. Sometimes it gets stuck in local optima. I have implemented the algorithm using the TD-0 advantage i.e.
A(s_t) = R(t+1) + \gamma…

204
- 433
- 1
- 5
- 19
4
votes
1 answer
DDPG not converging for a simple control problem
I am trying to solve a control problem with DDPG. The problem is simple enough so that I can do value function iteration for its discretized version, and thus I have the "perfect" solution to compare my results with. But I want to solve the problem…

Hypsoline
- 49
- 1
- 6
3
votes
0 answers
REINFORCE for Cartpole: Training Unstable
I am implementing REINFORCE for Cartpole-V0. However, the training process is very unstable. I have not implemented `early-stopping' for the environment and allow training to continue for a fixed (high) number of episodes. After a few thousand…

204
- 433
- 1
- 5
- 19
3
votes
1 answer
Ray - RLlib - Error with Custom env - continuous action space - DDPG - offline experience training?
Error while using offline experiences for DDPG. custom environment dimensions (action space and state space) seem to be inconsistent with what is expected in DDPG RLLIB trainer.
Ubuntu, Ray 0.7 version (latest ray), DDPG example, offline dataset.…

narasimha.m
- 61
- 5
3
votes
0 answers
Policy gradient in keras predicts only one action
I have trouble with the REINFORCE algorithm in keras with Atari games. After round about 30 episodes the network converges to one action. But the same algorithm is working with CartPole-v1 and converges with mean reward 495,0 after round 350…

tk338
- 176
- 4
- 11
2
votes
1 answer
Why `ep_rew_mean` much larger than the reward evaluated by the `evaluate_policy()` fuction
I write a custom gym environment, and trained with PPO provided by stable-baselines3. The ep_rew_mean recorded by tensorboard is as follow:
the ep_rew_mean curve for total 100 million steps, each episode has 50 steps
As shown in the figure, the…

Aramiis
- 21
- 2
2
votes
2 answers
How to solve the zero probability problem in the policy gradient?
Recently, I have tried to apply the naive policy gradient method to my problem. However, I found that the difference between different outputs of the last layer of the neural network is huge, which means that after applying the softmax layer, only…

HZ-VUW
- 842
- 9
- 20
2
votes
1 answer
What are Target Network in Policy Gradient algorithms in Reinforcement learning in simple terms with some example?
How does it differ from regular network
Source Text --> "In DDPG algorithm topology consist of two copies of network weights for each network, (Actor: regular and target) and (Critic: regular and target)"

keshav thosar
- 27
- 4
2
votes
1 answer
Can the output of DDPG policy network be a probability distribution instead of a certain action value?
We know that DDPG is a deterministic policy gradient method and the output of its policy network should be a certain action. But once I tried to let the output of policy network be a probability distribution of several actions, which means the…

JinZ
- 21
- 1
2
votes
1 answer
How to accumulate my loss over mini batches then calculate my gradient
My main question is; is averaging the loss the same thing as averaging the gradient and how do i accumulate my loss over mini batches then calculate my gradient?
I have been trying to implement policy gradient in Tensorflow and run into the issue…

Mike Jankowiak
- 29
- 3
2
votes
1 answer
Reward function for Policy Gradient Descent in Reinforcement Learning
I'm currently learning about Policy Gradient Descent in the context of Reinforcement Learning. TL;DR, my question is: "What are the constraints on the reward function (in theory and practice) and what would be a good reward function for the case…

Carsten
- 4,204
- 4
- 32
- 49
1
vote
0 answers
DDPG always choosing the boundaries actions
Iam trying to implement DDPG algorithm that take a state of 8 values and output action of size=4.
The actions are lower bounded by [5,5,0,0] and upper bounded by [40,40,15,15].
When I train my DDPG it always choose one of the boundaries for example…

Mohammad Bazzal
- 11
- 1
1
vote
0 answers
How to sample actions for a multi-dimensional continuous action space for REINFORCE algorithm
So, the problem that I am working on can be summarised like this:
The observation space is an 8x1 vector and all are continuous values. Some of them are in the range [-inf, inf] and some are [-360, 360].
The action space is a 4x1 vector. All the…

Rizwan Malik
- 11
- 2