4

I am using a model-based single agent reinforcement learning approach for autonomous flight.

In this project I used a simulator to collect training data (state , action , ending state) so that a Locally Weighted Linear Regression algorithm can learn the MODEL.

The STATE is defined by a vector : [Pitch , Yaw , Roll , Acceleration] to define the position of the drone in space. When given to the POLICY It has one other feature [WantedTrajectory]

The ACTION is defined by a vector too : [PowerOfMotor1 , PowerOfMotor2 , PowerOfMotor3 , PowerOfMotor4]

The REWARD is calculated depending on the accuracy of the trajectory taken: given a starting spacial state , a wanted trajectory and the ending spacial state the closer is the trajectory actually taken to the one wanted the less negative the reward is.

The algorithm for policy iteration is the following :

start from a state S0

loop    

         1) select the best action according to the Policy

         2) use LWLR to find the ending state

         3) calculate reward

         4) update generalized V function



endloop;

This way the action taken depends also on the trajectory wanted (choosed by the user) , the agent autonomously choose the power of the 4 motors (Trying to take the wanted trajectory and have a bigger , less negative , reward) and the policy is dynamic since It depends on the value function which is updated.

The only problem is that choosing the POLICY as follows (S = Pitch , Yaw , Roll , Acceleration , WantedTrajectory):

π(S) = argmax_a ( V( LWLR(S,a) ) )

(So between all the actions the one that from this state will lead the agent in the state with the biggest Value) costs a lot in terms of computation since the action space is very large.

Is there a way to generalize a POLOCY depending on an already generalized VALUE FUNCTION ?

DaddaBarba
  • 75
  • 5
  • Why don't you use action discretization? and Function approximation for the states would be nice. – NKN Sep 18 '15 at 18:15

1 Answers1

0

I think that actor-critic methods using policy gradient will be useful to you.

In that case, you use a paremetrized policy which is adjusted based on an objetive function based on your value function. There are some further improvements such as using advantage functions, etc.

David Silver made a nice video that you may find useful:

https://www.youtube.com/watch?v=KHZVXao4qXs&index=7&list=PL5X3mDkKaJrL42i_jhE4N-p6E2Ol62Ofa

Juan Leni
  • 6,982
  • 5
  • 55
  • 87