I had a NaNs issue that prevented me from running my model for even one iteration, and the batch normalization solution in this post enabled me to run my model. But I still have some iterations that return NaNs/ Infs, and they go away after a few iterations. Is this all right?
I also noticed that the number of LSTM nodes has an effect on this result. Can anyone explain the proper way to use batch normalization and the number of nodes in LSTM layers?
The structure of my model is similar to this post and I'm just wondering where in the model I should use batch normalization. Is this batch normalization correctly implemented, or do I need to add it after each LSTM layer?