Keras `evaluate` Function Returns Wrong Accuracy on Different Machines

Question

Background

I use an Anaconda environment in Windows 10, made following this post by Mike Müller:

conda create -n keras python=3.6
conda activate keras
conda install keras

This environment has Python 3.6.8, Keras 2.2.4, TensorFlow 1.12.0, and NumPy 1.16.1.

I was working on optimizing code for a team I just joined when I found I can't even run their code. I reduced it to a test case with an MCVE (at least, for me; apologies for not being able to give a testable example):

class TestEvaluation(unittest.TestCase):
    def setUp(self):
        # In-house function loads inputs and labels properly.
        self.inputs, self.labels = load_data()
        # Using a pretrained model, known to work.
        self.model = keras.models.load_model('model_name.h5')
        # Passes, and is loaded successfully.
        self.assertIsNotNone(self.model)

    def test_model_evaluation(self):
        # Fails on my machine, reporting high loss and 0% accuracy.
        scores = self.model.evaluate(self.inputs, self.labels)
        accuracy = scores[1] * 100
        self.assertAlmostEqual(accuracy, 93, delta=5)

Research

This exact scenario runs perfectly fine from someone else's computer, so we've deduced the following: we have the same code, model, and data. Therefore, it should be the environment, right?

I built more Anaconda environments to reproduce the version numbers that work on their machine. However, this didn't fix it. Moreover, this seems to be an issue that not many other people have had, as far I've found by searching online.

I went through the following other environments:

Python 3.6.4, Keras 2.2.4, TensorFlow 1.12.0, NumPy 1.16.2
- (The one that worked for someone else, though admittedly without Anaconda)
Python 3.5.2, Keras 2.2.2, TensorFlow 1.10.0, NumPy 1.15.2

Question

The model is pretrained, the validation set is correctly loaded, but Keras fails to report the ~93% accuracy I'm expecting.

How can I fix this issue of getting 0% accuracy?

Update

I've learned a lot more about the situation. I found that installing a Python 3.6 environment on Ubuntu 18.04 got me to random guessing (~25% accuracy). So, it's no longer 0%! Further, I tried to replicate a machine that's been used for testing a lot, which had Ubuntu 16.04.5. This got me to ~46% accuracy. I wasn't able to perfectly replicate it since Ubuntu forced me to update to 16.04.6 when I installed some packages, and I also don't know how they run things on the machine they test with (I tried myself, and it didn't work).

I also learned that the guy who compiled and saved the model was using MacOS High Sierra, but he also gets it to work in the lab environment. I'll need to follow up on that.

Further, I kept searching online and found others with the same issue:

Keras issue #7676 - An open issue for nearly 2 years. The OP reported his saved model works differently on different machines, which sounds a lot like my problem.
Keras issue #4875 - An open issue for over 2 years. This particular comment seems to be the common solution. I'm not sure if this will solve the problem or not, and I don't actually have the code that compiled this model. However, it seems that many people found issues in how their model was built and saved, so I might need to investigate this further...

I apologize for claiming a solution before, I was ecstactic to see that assertNotEqual(accuracy, 0) passed.

As first step of debugging, I suggest you to check if the weights of one specific layer of the model (after loading it) are same as what you see on the other systems and report it to me in order to help you overcome your issue. — pouyan, Mar 15 '19 at 21:24

score 0 · Accepted Answer · 2019-03-30T05:29:00.757

Be Aware

I previously wrote an incorrect answer, and this may very well be another poorly-formed solution. Please be aware I haven't fully tested this hypothesis. Also be aware that this is still an open issue in the Keras community and many people have messed things up in a number of ways to arrive at this problem.

Developing Our Solution

Let Person A be the guy who can run the model okay on our lab computers, as well as his MacBook. Let Person B be the one who can't (i.e. me and everyone else).

I got my team to take this problem more seriously. We got to the point where A has a terminal open at a desktop next to B. A runs the test script and gets 92% accuracy. B runs the script and gets 2%. At this point, we were on the same machine using the exact same Python environment and Keras settings (~/.keras). We were also sure that we had the same script, model, and data. Or, so we thought.

I chose to doubt everything at that point. I scp'd the script, model, and data from A's account to B's account. It worked. Here's what that could mean as a solution:

A Guess at the Problem

The files B had were bad. B got them from team storage on Google Drive, as well as Slack. Further, some were delivered by A through his MacBook. The scripts were genuinely the same. The model and data B had actually differed in binary, but had the same size in bytes, looked "similar" in binary, and could've possibly been an encoding issue.

It wasn't Google Drive. I uploaded and re-downloaded the correct files, and nothing went wrong. However, the wrong file was there to begin with.
Possibly Slack? Perhaps Slack was corrupting the encoding when B downloaded A's files.
Possibly it coming from a MacBook? MacOS generates a lot of .DS_Store-like files, and I don't know much about it. MacOS might've played a role in the model and data being OS-dependent. I wouldn't rule it out simply because I'm ignorant of how that OS operates. I heavily suspect this though because I happen to have a spare MacBook, and I got it to work in that environment before we started testing on the same machine.

Worst Case Scenario

We're accepting that we can get the model to work on a single machine that everyone has access to. Does this mean that the model might still not work on other machines? Unfortunately, yes.

We're not taking the time to test other machines after wasting nearly 2 months on this problem. I hope this research and debugging helps someone else out. I didn't want to leave it at "never mind, fixed it."

Big endian issue? Are you reading from binary files? Can you check the weights loaded (first 10 values of first layer, for ex.) — Simon Caby, Mar 30 '19 at 09:18
I checked the weights a while ago as @pouyan had suggested. Now that I have a bit more control over which files work and don’t, I can revisit that. Before, we found the largest layers were exactly the same, but I can’t say for certain we weren’t comparing the same bad file. — , Mar 30 '19 at 16:20