3

My SARSA with gradient-descent keep escalating the weights exponentially. At Episode 4 step 17 the value is already nan

Exception: Qa is nan

e.g:

6) Qa:
Qa = -2.00890180632e+303

7) NEXT Qa:
Next Qa with west = -2.28577776413e+303

8) THETA:
1.78032402991e+303 <= -0.1 + (0.1 * -2.28577776413e+303) - -2.00890180632e+303

9) WEIGHTS (sample)
5.18266630725e+302 <= -1.58305782482e+301 + (0.3 * 1.78032402991e+303 * 1)

I don't know where to look for the mistake I made. Here's some code FWIW:

def getTheta(self, reward, Qa, QaNext):
    """ let t = r + yQw(s',a') - Qw(s,a) """
    theta = reward + (self.gamma * QaNext) - Qa


def updateWeights(self, Fsa, theta):
    """ wi <- wi + alpha * theta * Fi(s,a) """
    for i, w in enumerate(self.weights):
        self.weights[i] += (self.alpha * theta * Fsa[i])

I have about 183 binary features.

Tjorriemorrie
  • 16,818
  • 20
  • 89
  • 131
  • An answer is hardly possible given the provided info. I would try reducing alpha/theta, and look in detail on the involved quantities. – davidhigh May 21 '14 at 15:23
  • Are you doing the normalization step, or just adding to the weights? – NKN Aug 13 '14 at 09:54
  • @NKN thanks, your normalization step helps. Still new to this, I wish there was more documentation on that. – Tjorriemorrie Aug 14 '14 at 05:43

2 Answers2

2

you need normalization in each trial. This will keep the weights in a bounded range. (e.g. [0,1]). They way you are adding the weights each time, just grows the weights and it would be useless after the first trial.

I would do something like this:

self.weights[i] += (self.alpha * theta * Fsa[i])
normalize(self.weights[i],wmin,wmax)

or see the following example (from literature of RL):

enter image description here

You need to write the normalization function by yourself though ;)

NKN
  • 6,482
  • 6
  • 36
  • 55
  • Could you please perhaps give the source of the literature? I would like to read more about it. – Tjorriemorrie Oct 21 '14 at 10:42
  • I would suggest this book: http://books.google.it/books?hl=en&lr=&id=UGUqcl8_T9QC&oi=fnd&pg=PP1&dq=reinforcement+learning+linear+function+approximation+lucian&ots=Xk47TPU8Ww&sig=88QfOYsStxB4gT1BByZqd5h97sQ&redir_esc=y#v=onepage&q=reinforcement%20learning%20linear%20function%20approximation%20lucian&f=false – NKN Oct 21 '14 at 11:37
0

I do not have access to the full code in your application, so I might be wrong. But I think that I know where you are going wrong. First and foremost, normalization should not be necessary here. For weights to get bloated so soon in this situation suggests something wrong with your implementation.

I think your update equation should be:-

self.weights[:, action_i] = self.weights[:, action_i] + (self.alpha * theta * Fsa[i])

That is to say that you should be updating columns instead of rows, because rows are for states and columns for for actions in the weight matrix.