Why are my results still not reproducible?

Question

I want to get reproducible results for a CNN. I use Keras and Google Colab with GPU.

In addition to recommendations to insert certain code snippets, which should allow a reproducibility, I also added seeds to the layers.

###### This is the first code snipped to run #####

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

###### This is the second code snipped to run #####

from __future__ import print_function  
import numpy as np 

import tensorflow as tf
print(tf.test.gpu_device_name())

import random as rn 
import os 
os.environ['PYTHONASHSEED'] = '0' 
np.random.seed(1)   
rn.seed(1)   
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)


###### This is the third code snipped to run #####

from keras import backend as K

tf.set_random_seed(1) 
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)  
K.set_session(sess)

###### This is the fourth code snipped to run #####

def model_cnn():
  model = Sequential()
  model.add(Conv2D(32, kernel_size=(3,3), kernel_initializer=initializers.glorot_uniform(seed=1), input_shape=(28,28,1)))
  model.add(BatchNormalization())
  model.add(Activation('relu'))

  model.add(Conv2D(32, kernel_size=(3,3), kernel_initializer=initializers.glorot_uniform(seed=2)))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(MaxPooling2D(pool_size=(2,2)))
  model.add(Dropout(0.25, seed=1))  

  model.add(Flatten())

  model.add(Dense(512, kernel_initializer=initializers.glorot_uniform(seed=2)))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dropout(0.5, seed=1))
  model.add(Dense(10, kernel_initializer=initializers.glorot_uniform(seed=2)))
  model.add(Activation('softmax'))

  model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.001), metrics=['accuracy'])
  return model


def split_data(X,y):
  X_train_val, X_val, y_train_val, y_val = train_test_split(X, y, random_state=42, test_size=1/5, stratify=y) 
  return(X_train_val, X_val, y_train_val, y_val) 


def train_model_with_EarlyStopping(model, X, y):
  # make train and validation data
  X_tr, X_val, y_tr, y_val = split_data(X,y)

  es = EarlyStopping(monitor='val_loss', patience=20, mode='min', restore_best_weights=True)

  history = model.fit(X_tr, y_tr,
                      batch_size=64,
                      epochs=200, 
                      verbose=1,
                      validation_data=(X_val,y_val),
                      callbacks=[es])    

  return history

###### This is the fifth code snipped to run #####

train_model_with_EarlyStopping(model_cnn(), X, y)

Always I run the above code I get different results. Does the reason lies in the code, or it is simply not possible to obtain reproducible results in Google Colab with GPU support?

The complete code (there are unneccessary parts in the code, such as libraries which are not used):

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
from __future__ import print_function # NEU 
import numpy as np 

import tensorflow as tf
import random as rn 
import os 
os.environ['PYTHONASHSEED'] = '0' 
np.random.seed(1)   
rn.seed(1)   
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1) 
from keras import backend as K

tf.set_random_seed(1)  
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)   
K.set_session(sess)  

import os
local_root_path = os.path.expanduser("~/data/data")
print(local_root_path)
try:
  os.makedirs(local_root_path, exist_ok=True)  
except: pass

def ListFolder(google_drive_id, destination):
  file_list = drive.ListFile({'q': "'%s' in parents and trashed=false" % google_drive_id}).GetList()
  counter = 0
  for f in file_list:
    # If it is a directory then, create the dicrectory and upload the file inside it
    if f['mimeType']=='application/vnd.google-apps.folder': 
      folder_path = os.path.join(destination, f['title'])
      os.makedirs(folder_path, exist_ok=True)
      print('creating directory {}'.format(folder_path))
      ListFolder(f['id'], folder_path)
    else:
      fname = os.path.join(destination, f['title'])
      f_ = drive.CreateFile({'id': f['id']})
      f_.GetContentFile(fname)
      counter += 1
  print('{} files were uploaded in {}'.format(counter, destination))
ListFolder("1DyM_D2ZJ5UHIXmXq4uHzKqXSkLTH-lSo", local_root_path)

import glob
import h5py
from time import time
from keras import initializers 
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential, model_from_json
from keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization, merge
from keras.layers import Convolution2D, MaxPooling2D, AveragePooling2D
from keras.optimizers import SGD, Adam, RMSprop, Adagrad, Adadelta, Adamax, Nadam
from keras.utils import np_utils
from keras.callbacks import LearningRateScheduler, ModelCheckpoint, TensorBoard, ReduceLROnPlateau
from keras.regularizers import l2
from keras.layers.advanced_activations import LeakyReLU, ELU
from keras import backend as K
import numpy as np
import pickle as pkl
from matplotlib import pyplot as plt
%matplotlib inline
import gzip
import numpy as np
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten
from keras.datasets import fashion_mnist
from numpy import mean, std
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, StratifiedKFold
from keras.datasets import fashion_mnist
from keras.utils import to_categorical
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
from keras.optimizers import SGD, Adam
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import auc, average_precision_score, f1_score

import time
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from google.colab import files
from PIL import Image 



def model_cnn():
  model = Sequential()
  model.add(Conv2D(32, kernel_size=(3,3), kernel_initializer=initializers.glorot_uniform(seed=1), input_shape=(28,28,1)))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Conv2D(32, kernel_size=(3,3), kernel_initializer=initializers.glorot_uniform(seed=2)))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(MaxPooling2D(pool_size=(2,2)))
  model.add(Dropout(0.25, seed=1))  
  model.add(Flatten())
  model.add(Dense(512, kernel_initializer=initializers.glorot_uniform(seed=2)))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dropout(0.5, seed=1))
  model.add(Dense(10, kernel_initializer=initializers.glorot_uniform(seed=2)))
  model.add(Activation('softmax'))
  model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.001), metrics=['accuracy'])
  return model

def train_model_with_EarlyStopping(model, X, y):
  X_tr, X_val, y_tr, y_val = split_train_val_data(X,y)
  es = EarlyStopping(monitor='val_loss', patience=20, mode='min', restore_best_weights=True)      
  history = model.fit(X_tr, y_tr,
                      batch_size=64,
                      epochs=200, 
                      verbose=1,
                      validation_data=(X_val,y_val),
                      callbacks=[es])    
  evaluate_model(model, history, X_tr, y_tr)
  return history 


```

Are you getting significantly different results or non-exactly-equal results? I have seen that some GPU operations are not always perfectly reproducible (e.g. slight differences in sum-reduction of big tensors), which is supposed to be a tradeoff for better performance. There are some filed issues about these kinds of things, like [this](https://github.com/tensorflow/tensorflow/issues/19200). I'm not sure to what extent that still happens though. — jdehesa, Oct 17 '19 at 13:23
I have test accuracies of 0.8905, 0.8796, 0.8849... The problem is also that I get the highest accuracies in different epochs. So that EarlyStopping stopps in epoch 15, 29, 30... — Code Now, Oct 17 '19 at 13:59
It'd help to see your _full_ code, including input data with correct shapes (can be randomly generated) — OverLordGoldDragon, Oct 17 '19 at 14:24
I have attached the full code under the previous code snippets — Code Now, Oct 17 '19 at 15:02
Remember to mention users by name w/ '@' to notify them of your responses - checked back on this question myself. — OverLordGoldDragon, Oct 18 '19 at 05:29
check my answer on Stackoverflow: [I tried most of the solutions on the web and just the following codes worked for me...](https://stackoverflow.com/a/73392225/4183916) — Ali karimi, Aug 17 '22 at 16:59
check my answer on stackoverflow: [I tried most of the solutions on the web and just the following codes worked for me...](https://stackoverflow.com/a/73392225/4183916) — Ali karimi, Aug 17 '22 at 17:00

OverLordGoldDragon · Accepted Answer · 2019-10-19T00:49:13.983

The problem isn't limited to Colab, and is reproducible locally. The behavior, however, may be inevitable.

Code at bottom is a minimally-reproducible version of your code, with fit parameters tweaked for faster testing. What I observed is, the maximum difference for loss is only 0.0144% for 468 iterations per run, across 5 runs. This is pretty good. With batch_size=64, 60000 samples, and 20 epochs, you'll have 18750 iterations - which will amplify this figure substantially.

Regardless, GPU parallelism is the most likely culprit driving the randomnes - and the small differences do accumulate over time to yield a substantial difference - demo below. If 1e-8 seems small, try adding random noise to half your weights w/ magnitude clipped at 1e-8, and witness its life philosophy change.

The role of the seeds becomes dramatically pronounced if you don't use them - try it, all your metrics will fly rampant within the first 10 iterations. Also, loss is better for measuring runtime differences, as accuracy's lot more sensitive to numeric precision errors: the difference between 60% accuracy and 70% accuracy on a 10-sample batch is a prediction that differs by 0.000001 w.r.t. 0.5 - but loss will barely budge.

Lastly, note that your hyperparameter choice will have a far greater impact upon model performance than randomness; no matter how many seeds you throw, they won't magic a model into SOTA. -- I recommend this fine clip.

Your code - is fine. You've taken all practical steps to ensure reproducibility, with an exception: PYTHONHASHSEED must be set before your Python kernel starts.

What can you do to reduce randomness?

Repeat runs, average results. Understandably that's expensive, but note that even a perfectly reproducible run isn't perfectly informative, as model variance w.r.t. train & validation sets is likely to be much greater than noise-induced randomness
K-Fold Cross-Validation: can mitigate both data & noise variance significantly
Larger validation set: extracted features can differ only so much due to noise; the larger the validation set, the less small perturbations in weights should reflect in metrics

GPU Parallelism: amplifying float error

print(2. * 11. / 9.)  # 2.4444444444444446
print(2. / 9. * 11.)  # 2.444444444444444

Order of operations matters, and by exploiting multithreading, GPU parallelism gives no guarantee whatsoever of operations being executed in the same order. On a first look, the difference may look innocent - but give it enough iterations ...

one = 1
for _ in range(int(1e8)):
    one *= (2. / 9. * 11.) / (2. * 11. / 9.)
print(one)     # 0.9999999777955395
print(1 - one) # 1.8167285897874308e-08

... and a "one" is a typical small weight value of 1e-08 away from being its original self. If 100 million iterations seems to be a stretch, consider that the operation completed in ~half a minute, whereas your model can train over an hour, and former runs entirely on CPU.

Minimal reproducible experimentation:

import tensorflow as tf
import random as rn 
import numpy as np
np.random.seed(1)   
rn.seed(2)   
tf.set_random_seed(3)

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization
from keras.layers import MaxPooling2D, Conv2D
from keras.optimizers import Adam

def model_cnn():
  model = Sequential()
  model.add(Conv2D(32, kernel_size=(3,3), 
                   kernel_initializer='he_uniform', input_shape=(28,28,1)))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Conv2D(32, kernel_size=(3,3), kernel_initializer='he_uniform'))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(MaxPooling2D(pool_size=(2,2)))
  model.add(Dropout(0.25))
  model.add(Flatten())
  model.add(Dense(512, kernel_initializer='he_uniform'))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dropout(0.5))
  model.add(Dense(10, kernel_initializer='he_uniform'))
  model.add(Activation('softmax'))
  model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.001), 
                metrics=['accuracy'])
  return model

np.random.seed(1)   
rn.seed(2)     
tf.set_random_seed(3) 

X_train = np.random.randn(30000, 28, 28, 1)
y_train = np.random.randint(0, 2, (30000, 10))
X_val   = np.random.randn(30000, 28, 28, 1)
y_val   = np.random.randint(0, 2, (30000, 10))
model = model_cnn()

np.random.seed(1)   
rn.seed(2)   
tf.set_random_seed(3)

history = model.fit(X_train, y_train, batch_size=64,shuffle=True, 
                    epochs=1, verbose=1, validation_data=(X_val,y_val))

Run differences:

loss: 12.5044 - acc: 0.0971 - val_loss: 11.5389 - val_acc: 0.1051
loss: 12.5047 - acc: 0.0958 - val_loss: 11.5369 - val_acc: 0.1018
loss: 12.5055 - acc: 0.0955 - val_loss: 11.5382 - val_acc: 0.0980
loss: 12.5042 - acc: 0.0961 - val_loss: 11.5382 - val_acc: 0.1179
loss: 12.5062 - acc: 0.0960 - val_loss: 11.5366 - val_acc: 0.1082

If I understand correctly, then it would be advisable not to use seeds and run the model multiple times? Actually, I wanted to compare this model architecture with other 5 model architectures. How many runs should I do per architecture? Assuming I don't use EarlyStopping and I use the Fashion-MNIST dataset, which has a test dataset of 10000 images, then I could simply take the test dataset for testing after each run instead of doing K-fold cross validation, because the test dataset should provide enough variance? (Sorry for the direct question but I'm a beginner in deep learning) — Code Now, Oct 18 '19 at 08:25
@CodeNow No problem - [see here](https://stackoverflow.com/questions/58103035/keras-early-stopping-with-train-on-batch/58103272#58103272). I've trained considerably more complex models on considerably more complex data, what I can tell you is - random seeds aren't perfect w/ GPU, but they're definitely good enough. For your application, I'd (1) select X (see link); (2) run each hyperparam configuration for X epochs; (3) select model w/ best hyperparams. Give up test set, merge with val set - then, to estimate test performance, run K-fold CV on your _final model_. — OverLordGoldDragon, Oct 18 '19 at 13:43
@CodeNow Also, try to write your own early-stop code (e.g. callbacks), as _loss_ isn't always the best metric (especially for imbalanced data classification). Something like `if f1_score > best_f1_score: model.save()`, and include best metric in model name (e.g. `'model' + str(best_f1_score) + '.h5'`) to allow easy comparison. Ideally, include key hyperparams & their values in model name to allow easy comparison — OverLordGoldDragon, Oct 18 '19 at 13:46
\w@OverLordGoldDragon What would be wrong if I just optimize the hyperparameters like learning rate, number of epochs etc. (so without early stopping) via GridSearchCV for each model architecture. I always make sure that the same KFoldSplits for each model are done (by using the same np.random.seed) . After that I run several times each model (without setting any seeds) with the respective found optimal hyperparameters with the full training dataset and test with the full testing dataset. And then for comparing I average these results for each model und calculate the standard deviation? — Code Now, Oct 18 '19 at 18:08
@CodeNow That's taking the scope of the question substantially outside original, but I'll answer briefly: nothing 'wrong', but I'd advise going with the scheme I described instead. Reason being, early stopping is "free lunch" in a sense that extracted features are best after `X`, epochs, and `X` will vary for your K-Fold split - but you _aren't_ sacrificing "control" value by varying `X`, as such control is redundant. Your final model's likely to be much better if you compare early-stopped models as opposed to fixed-`X` models (but you _do_ need the same seed across each fold) — OverLordGoldDragon, Oct 18 '19 at 18:14
@CodeNow Early-stopping also allows you to use higher learning rates, which, while being less stable, may yield superior optima — OverLordGoldDragon, Oct 18 '19 at 18:15

Why are my results still not reproducible?

1 Answers1

Linked