Hello I'm working on a total control of the cartpole problem (inverted pendulum). My aim is for the system to reach stability meaning all the states(x, xdot,theta and theta) should converge to zero. I am using q-learning with a reward function as defined below.
Q_table[pre_s + (a,)] += alpha * (R + gamma *(argmax(Q_table[s])) - Q_table[pre_s + (a,)])
R=1000*cos(theta)-1000*(theta_dot**2)-100*(x_dot**2)-100*(x**2)
unfortunately, there is no convergence. By the q-table graph, I can see it increasing and stabilising at the maximum value, but the states just stay within a certain bound and do not go to zero. I feel like my agent is not learning fast enough and at some point i not learning anymore. Can anyone help me.