Perhaps a question better posed to Computer Science or Cross Validated?
I'm beginning some work with LSTM on sequences of arbitrary length and one problem I'm experiencing and that I haven't seen addressed, is that my network seems to have developed a couple parameters that grow linearly (perhaps as a measure of time?).
The obvious issue with this is that the training data is bounded at a sequence of length x
and so the network grows this parameter reasonably up until tilmestep x
. But after that, the network will eventually NAN because values are getting too extreme.
Has anyone read anything about the normalization of stabilization of states over time?
Any suggestions would be much appreciated.