I'm trying to build an Agent that can play Pocket Tanks using RL. The problem I'm facing now is that how can I train a neural network to output the correct Power and Angle. so instead of actions classification. and I want a regression.
-
[`deep Q-learning`](https://skymind.ai/wiki/deep-reinforcement-learning) – modesitt Aug 09 '18 at 19:19
-
q learning won't help cuz it outputs the q value for the actions. but I want a power and angle not an action! – NotMoftah Aug 09 '18 at 19:44
-
Possible duplicate of [Generalizing Q-learning to work with a continuous \*action\* space](https://stackoverflow.com/questions/7098625/generalizing-q-learning-to-work-with-a-continuous-action-space) – maxy Aug 09 '18 at 21:22
1 Answers
In order to output the correct power and angle, all you need to do is go into your neural network architecture and change the activation of your last layer.
In your question, you stated that you are currently using an action classification output, so it is most likely a softmax output layer. We can do two things here:
If the power and angle has hard constraints, e.g. the angle cannot be greater than 360°, or the power cannot exceed 700 kW, we can change the softmax output to a TanH output (hyperbolic tangent) and multiply it by the constraint of power/angle. This will create a "scaling effect" because tanh's output is between -1 and 1. Multiplying the tanh's output by the constraint of the power/angle ensures that the constraints are always satisfied and the output is the correct power/angle.
If there are no constraints on your problem. We can simply just delete the softmax output all together. Removing the softmax allows for the output to no longer be constrained between 0 and 1. The last layer of the neural network will simply act as a linear mapping, i.e., y = Wx + b.
I hope this helps!
EDIT: In both cases, your reward function to train your neural network can simply be a MSE loss. Example: loss = (real_power - estimated_power)^2 + (real_angle - estimated_angle)^2

- 2,544
- 18
- 32
-
that's what I did, the problem here is that I don't know the estimated_power, so I want to train the network on a RL base. I don't have any activations on the last layer. all I said was that I knew pure RL output binary actions(take this or this) but I want it to output (do this using this value) – NotMoftah Aug 15 '18 at 19:56
-
Hello Max, I am a little confused on how you are getting the binary actions if there are no activations in the last layer. In order to have binary outputs, the activation of the output layer must be either a sigmoid function if there is only one output. Or it is a softmax function, if you have multiple outputs. The idea is that the output gets squashed between values of 0 - 1. So 1 means do the action, 0 means don't do it. If you have no activation in the last layer, the output is unbounded (ie -inf to inf), so it is a regression and not a classification. – Rui Nian Aug 15 '18 at 20:56
-
I'd like to start my comment by saying that my English is a bit poor that's why I couldn't express my bug at all. Look, RL basically output a Q value for all the actions, we usually pick the action with highest reward. that's easy I did that a lot but I want some sort of RL where the output is not the expected reward, instead I want a specific angle and power to fire my tank. I hope you got my idea. – NotMoftah Aug 16 '18 at 02:42
-
Hello Max. I think there may be some misconception regarding the output of reinforcement learning. RL will only output the Q-value if the RL algorithm you are using is the Q-learning algorithm. The Q-learning algorithm describes the "goodness" of performing a certain action in a given state. Traditionally, RL is done through policy iteration and value iteration. The Q-value you're talking about is a value iteration approach. If you want to output the power / angle, you need to use the policy iteration approach, where the RL takes some actions and outputs an action. – Rui Nian Aug 16 '18 at 15:51
-
To do so, it is actually much simpler than what you have, you simply have to just remove the Q-learning equation from your code and map each action to a reward directly. – Rui Nian Aug 16 '18 at 15:53