2

Hello I'm working on a total control of the cartpole problem (inverted pendulum). My aim is for the system to reach stability meaning all the states(x, xdot,theta and theta) should converge to zero. I am using q-learning with a reward function as defined below.

Q_table[pre_s + (a,)] += alpha * (R + gamma *(argmax(Q_table[s])) - Q_table[pre_s + (a,)])
R=1000*cos(theta)-1000*(theta_dot**2)-100*(x_dot**2)-100*(x**2)

unfortunately, there is no convergence. By the q-table graph, I can see it increasing and stabilising at the maximum value, but the states just stay within a certain bound and do not go to zero. I feel like my agent is not learning fast enough and at some point i not learning anymore. Can anyone help me.

IvanH
  • 5,039
  • 14
  • 60
  • 81
Stevy KUIMI
  • 47
  • 2
  • 6
  • Welcome to Stack Overflow! I edited the title of your question to be more readable. It is especially necessary to mark code sample properly to suppress parts of it be interpreted as formatting symbols. I also added some spaces and articles. – IvanH Nov 05 '18 at 19:58
  • Your reward is quite uncommon for this task. Maybe those large values (*1000) cause instability. Have a look [at OpenAI gym implementation cost function](https://github.com/openai/gym/blob/master/gym/envs/classic_control/pendulum.py#L39), which is the most common for this task. Also, a lot depends on your learning rate `alpha` and your exploration strategy (I guess e-greedy). – Simon Nov 08 '18 at 10:39

1 Answers1

0

Assuming you are using an epsilon-greedy approach, your values for alpha and gamma could make a big difference. I suggest playing around with those values and see how that influences your agent.

Additionally, can you explain the logic behind your reward function? It seems unusual to multiply everything by 1000.

R.F. Nelson
  • 2,254
  • 2
  • 12
  • 24