Keras predict() returns a better accuracy than evaluate()

Question

I set up a model with Keras, then I trained it on a dataset of 3 records and finally I tested the resulting model with evaluate() and predict(), using the same test set for both functions (the test set has 100 records and it doesn't have any record of the training set, as much as it can be relevant, given the size of the two datasets). The dataset is composed by 5 files, where 4 files represent each one a different temperature sensor, that each minute collects 60 measurements (each row contains 60 measurements), while the last file contains the class labels that I want to predict (in particular, 3 classes: 3, 20 or 100).

This is the model I'm using:

n_sensors, t_periods = 4, 60

model = Sequential()

model.add(Conv1D(100, 6, activation='relu', input_shape=(t_periods, n_sensors)))

model.add(Conv1D(100, 6, activation='relu'))

model.add(MaxPooling1D(3))

model.add(Conv1D(160, 6, activation='relu'))

model.add(Conv1D(160, 6, activation='relu'))

model.add(GlobalAveragePooling1D())

model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

That I train: self.model.fit(X_train, y_train, batch_size=3, epochs=5, verbose=1)

Then I use evaluate: self.model.evaluate(x_test, y_test, verbose=1)

And predict:

predictions = self.model.predict(data)
result = np.where(predictions[0] == np.amax(predictions[0]))
if result[0][0] == 0:
    return '3'
elif result[0][0] == 1:
    return '20'
else:
    return '100'

For each class predicted, I confront it with the actual label, and then I calculate correct guesses / total examples, that should be equivalent to accuracy from the evaluate() function. Here's the code:

correct = 0
for profile in self.profile_file: #profile_file is an opened file
    ts1 = self.ts1_file.readline()
    ts2 = self.ts2_file.readline()
    ts3 = self.ts3_file.readline()
    ts4 = self.ts4_file.readline()
    data = ts1, ts2, ts3, ts4
    test_data = self.dl.transform(data) # see the last block of code I posted
    prediction = self.model.predict(test_data)
    if prediction == label:
       correct += 1
acc = correct / 100 # 100 is the number of total examples

Data feeded to evaluate() is taken from this function:

label = pd.read_csv(os.path.join(self.testDir, 'profile.txt'), sep='\t', header=None)
label = np_utils.to_categorical(label[0].factorize()[0])
data = [os.path.join(self.testDir,'TS2.txt'),os.path.join(self.testDir, 'TS1.txt'),os.path.join(self.testDir,'TS3.txt'),os.path.join(self.testDir, 'TS4.txt')]
df = pd.DataFrame()
for txt in data:
    read_df = pd.read_csv(txt, sep='\t', header=None)
    df = df.append(read_df)
df = df.apply(self.__predict_scale)
df = df.sort_index().values.reshape(-1,4,60).transpose(0,2,1)
return df, label

While data feeded to predict() is taken from this other one:

df = pd.DataFrame()
for txt in data: # data 
    read_df = pd.read_csv(StringIO(txt), sep='\t', header=None)
    df = df.append(read_df)
df = df.apply(self.__predict_scale)
df = df.sort_index().values.reshape(-1,4,60).transpose(0,2,1)
return df

Accuracies yielded by evaluate() and predict() are always different: in particular, the maximum difference I noted was when evaluate() resulted in a 78% accuracy while predict() in a 95% accuracy. The only difference between the two functions is that I make predict() work on an example at a time, while evaluate() takes the entire dataset all at once, but it should result in no difference. How can it be?

UPDATE 1: It seems that the problem is in how I prepare my data. In the case of predict(), I transform only one line at a time from each file using the last block of code I posted, while in feeding evaluate(), I transform the entire files using the other function reported. Why should it be different? It seems to me that I'm applying the exact same transformation, the only difference is in the number of rows transformed.

How do you compute the accuracy from the predictions? Also, could you please explain what is `result`? — rvinas, Sep 12 '19 at 07:31
I collect the predictions and I increase a counter each time they are equal to the real class, then I divide the counter for the number of predictions made. Result is associated to predictions, as predict() returns an array of probabilities for the example to belong to each class (I have 3 possible classes, therefore it returns 3 probabilities): with result, I get the index of the highest probability, that I map to the corresponding class through the above if chain. — DDD, Sep 12 '19 at 12:07
@DDD the issue most likely comes from the way you compute accuracy after prediction, could you share that part of the code? — filippo, Sep 12 '19 at 15:16
@filippo added right now. Let me know if you need any other information. Thank you for your help. — DDD, Sep 12 '19 at 15:38
Sounds like a bug in your code somewhere more than an issue with keras, add some debugging and check if the dataframe you're providing to `evaluate` and `predict` are really the same (minus the line by line thing). I cannot see the bug in the code you posted, but there's still some pieces missing here and there... Check if the data is the same, then check the way you're calculating accuracy and see how keras is doing it https://github.com/keras-team/keras/blob/master/keras/metrics.py (it should be `categorical_accuracy`) — filippo, Sep 13 '19 at 07:42
@filippo if you need some other block of code feel free to ask, I'm pretty much stuck as I tried all the possible solutions I had in mind, so a helping hand would be highly appreciated... — DDD, Sep 13 '19 at 10:21
When you use categorical loss, Keras probably uses "categorical_accuracy". Are you using categorical accuracy in your test? — Daniel Möller, Sep 16 '19 at 14:38

sparkles · Answer 1 · 2019-09-17T10:43:33.917

This question was already answered here

what happens is when you evaluate the model, since your loss function is categorical_crossentropy, metrics=['accuracy'] calculates categorical_accuracy.

But predict has a default set to binary_accuracy.

So essentially you are calculating categorical accuracy with evaluate and and binary accuracy with predict. this is the reason they are so widely different.

the difference between categorical_accuracy and binary_accuracy is that categorical_accuracy check if all the outputs match with your y_test and binary_accuracy checks if each of you outputs matches with your y_test.

Example(single row):

prediction = [0,0,1,1,0]
y_test = [0,0,0,1,0]

categorical_accuracy = 0%

since 1 output does not match the categorical_accuracy is 0

binary_accuracy = 80%

even though 1 output doesn't match the rest of 80% do match so accuracy is 80%

Keras predict() returns a better accuracy than evaluate()

1 Answers1