Keras Batch Normalization "is broken": model fails to predict. Is it _really_ broken? Is there a fix? Or specific documentation about?

Question

Intro

I am making a classifier to recognize presence of defects in pictures, and in the path of improving my models, I tried Batch Normalization, mainly to exploit its ability to fasten convergence.

While it gives the expected speed benefits, I also observed some strange symptoms:

validation metrics are far from good. It smells of overfitting of course
predictions calculated at any point during training are completely wrong, particularly when images are picked from the training dataset; the corresponding metrics match with the (val_loss, val_acc) rather than with (loss, acc) printed during training

This failing to predict is the evidence that worries me the most. A model which does not predict the same as in training, is useless!

Searches

Googling around I found some posts that seem to be related, particularly this one (Keras BN layer is broken) which also claims the existence of a patch and of a pull request, that sadly "was rejected".

This is quite convincing, in that it explains a failure mechanism that matches my observations. As far as I understand, since BN calculates and keeps moving statistics (exponential averages and standard deviations) for doing its job, which require many iterations to stabilize and become significant, of course it will behave bad when it comes to make a prediction from scratch, when those statistics are not mature enough (in case I have misunderstood this concept, please tell me).

Actual Questions

But thinking more thoroughly, this doesn't really close the issue, and actually raises further doubts. I am still perplexed that:

This Keras BN being broken, is said to affect the use case of transfer learning, while mine is a classical case of a convolutional classifier, trained starting form standard glorot initialization. This should have been complained about by thousands of users, while instead there isn't much discussion about)
technically: if my understanding is correct, why aren't these statistics (since they are so fundamental for prediction) saved in the model, so that their latest update is available to make a prediction? It seems perfectly feasible to keep and use them at prediction time, as for any trainable parameter
managementwise: if Keras' BN were really broken, how could such a deadful bug remain unaddressed for more than one year? Isn't really out there anybody using BN and needing predictions out of their models? And not even anybody able to fix it?
more practically: on the contrary, if it is not a bug, but just a bad understanding on how to use it, were do I get a clear illustration of "how to correctly get a prediction in Keras for a model which uses BN?" (demo code would be appreciated)

Obviously I would really love that the right questions is the last, but I had to include the previous ones, given the evidence of someone claiming that Keras BN is broken.

Note to SE OP: before *closing the question as too broad*, please consider that, being not really clear what the issue is (Keras BN being broken, or the users being unable to use it properly), I had to offer more directions, among which whoever wishing to answer can choose.

Details

I am using keras 2.2.4 from a python 3.6 virtual environment (under pyenv/virtualenv).
data are fed through a classic ImageDataGenerator() + flow_from_directory() / flow_from_dataframe() scheme (augmentation is turned off though: only rescale=1./255 is applied), but I also tried to make them static
actually in the end, for verifying the above behaviour, I generated only one dataset x,y=next(valid_generator) and used an unique batch scheme for both training and validation. While on the training side it converges (yes, the aim was exactly to let it overfit!), on the validation side both metrics are poor and predictions are completely wrong and erratic (almost random)
in this setup, if BN is turned off, val_loss and val_acc match exactly with loss and acc, and with those that I can obtain from predictions calulated after training has finished.

Update

In the process of writing a minimal example of the issue, after battling to put in evidence the problem, I recognized that the problem is showing/not showing up in different machines. In particular, the problem is evident on a host running Keras 2.3.1, while another host with Keras 2.2.4 doesn't show it. I'll post a minimal example here along with specific module versions asap.

You have to include your example (actual code that we can run) that according to you is broken, I have used Keras' BN layer many times and it works fine, so its not clear to me what exact issue you are having. — Dr. Snoopy, Oct 30 '19 at 15:01
See [this answer](https://stackoverflow.com/questions/58612783/batch-normalization-yes-or-no/58617741#58617741) — OverLordGoldDragon, Oct 30 '19 at 15:03
Thanks @OverLordGoldDragon, I reviewed the link and found again [this one](https://github.com/keras-team/keras/issues/12400). So Keras BN is definitely bugged and everybody using it accepts to obtain no sensate prediction out of it? I can't believe this — lurix66, Oct 30 '19 at 15:38
@Matias, here you are, the first one claiming it works fine. I'll post my own minimalistic code example asap (btw, previous comment link includes code (but no solution) — lurix66, Oct 30 '19 at 15:47
The link @OverLordGoldDragon provided shows a problem in tf.keras, not for keras, they are different implementations. There could surely be bugs in tf.keras, so this is why examples are very important — Dr. Snoopy, Oct 30 '19 at 15:50
Unsure if "accepts" is the best word, more like "official devs can't be bothered with debugging most things" - for understandable reasons, but not ones I accept (i.e. pay your damn devs, Google). The thread you stumbled upon offers solid evidence that not all's alright with BN. Matias' position's more accurately rephrased as, "it's mostly fine, as long as your data's fine". As noted in my answer, I'll investigate it eventually. — OverLordGoldDragon, Oct 30 '19 at 15:53
@MatiasValdenegro False; `keras` BN does not fully work as intended. If it was fixed in 2.3.0+, I wouldn't know, but 2.2.4 and below, it's broken. — OverLordGoldDragon, Oct 30 '19 at 15:54
@OverLordGoldDragon Can you point to the actual commit fix? If not, its just speculation. It could just be numerical instability that is very hard to find out. Its easy to rant but hard to provide hard proofs. — Dr. Snoopy, Oct 30 '19 at 15:59
@MatiasValdenegro Numeric instability doesn't yield the dramatic differences I've shown in the answer over a mere 100 iterations, w/ fixed seeds. TF2 clearly works as intended, TF1.14.0 doesn't - which commit made the difference I don't know, but it wouldn't have "fixed itself." — OverLordGoldDragon, Oct 30 '19 at 16:07
@OverLordGoldDragon my point is that you don't know what changed, so you cannot claim that something was fixed this should have been intentional. About numerical issues, just see this talk: https://www.youtube.com/watch?v=x7psGHgatGM — Dr. Snoopy, Oct 30 '19 at 20:42
@MatiasValdenegro A 20-min clip response to an SO comment doesn't seem quite fitting, but I'll still put it on my list. Regardless, you are making a far greater claim than I am - that is, TensorFlow has some severe numeric instability issues. This is simply false, though I won't be 'proving' that here - you're free to open a Git issue on this if you wish. — OverLordGoldDragon, Oct 30 '19 at 20:50
@MatiasValdenegro Also, in case it wasn't clear, I wasn't referring to BN since your "Can you point to the..." comment - but to [this answer](https://stackoverflow.com/questions/58612783/batch-normalization-yes-or-no/58617741#58617741), which focuses on the Activation (identity) layer. If you were discussing BN this entire time, I admit I'm not nearly in a position to disagree (on numeric instability), and anything's up for grabs. — OverLordGoldDragon, Oct 30 '19 at 20:52
@OverLordGoldDragon Not specifically for tensorflow, my point is that in Machine Learning there are many numerical issues, and the problem is that we are not aware of it, and don't know the reasons. The linked talk describes this problem with some trivial evidence. From a scientific POV, we need to know if something breaks and why, and for this one needs information, which as not provided in this question. And finally, keras and tf.keras are not the same thing, I use plain keras, not tf.keras and I haven't seen this problem, it could be specific tf.keras since they made many changes. — Dr. Snoopy, Oct 30 '19 at 21:03
@MatiasValdenegro I understand the instability argument, but it's a question of _extent_; if linked answer's discrepancies were due to sheer instability, TF would be worthless. The BN matter in particular concerns itself with differences in train vs. inference modes, and I never looked into it in detail so I cannot tell what the deal is. OP's question, however, is well-posed - agreeably it'd benefit from a minimal reproducible example, but it does link existing threads which provide just that. — OverLordGoldDragon, Oct 30 '19 at 21:12
@lurix66 Your question asks many questions, or is at least formulated as if it does - I'd suggest reorganizing it to convey a single unified question, and also include a minimally reproducible example or a direct link to (e.g. a Github comment). Else a mod may close your question. — OverLordGoldDragon, Oct 30 '19 at 21:15
@OverLordGoldDragon The talk shows that sometimes there are bugs or unexplained issues (like numerics), and you don't know when these happen, and its really bad to not know that you don't know. I just require information to get an explanation at the issue. We cannot reproduce the issue here, because there is no code. — Dr. Snoopy, Oct 31 '19 at 11:53