tensorflow: jupyter kernel dies when running Convolutional Network

Question

I am trying to run a demo Convolutional Neural Network from the code samples in the book PRACTICAL CONVOLUTIONAL NEURAL NETWORKS, by Sewak, et. al. This is a simple dog/cat classifier using Tensorflow. The problem is that I am running this Tensorflow code in a Jupyter notebook,and the kernel keeps dying when I execute the code to start training the network. I was not sure if this was an issue with the notebook, or if there was something missing in the demo code, or if this is a known issue and I should not train in a jupyter notebook?

So let me provide a little detail on the environment. I have a docker container that has Tensorflow GPU, Keras, and the other CUDA libraries installed. I have 3 GPUs on my computer. Inside the container there is an installation of Miniconda, so I am able to load and run notebooks, etc.

Here are a couple of thoughts that I had, which could be causing the notebook Python 3.6 kernel to die.

I did not specifically identify the GPU to use in the Tensorflow code.
There could be an issue where the memory in the container is now allowed to grow (https://github.com/tensorflow/tensorflow/issues/9829)

I am not familiar enough with Tensorflow yet to really know the source of the problem. Since the code is running inside a container, the usual debugging tools are bit more limited.

The full code for training is located in the github repository: https://github.com/PacktPublishing/Practical-Convolutional-Neural-Networks/blob/master/Chapter03/Dog_cat_classification/CNN_DogvsCat_Classifier.py

Here is the optimize function that is used for training. Now sure if anyone can see some particular feature missing.

def optimize(num_iterations):
    # Ensure we update the global variable rather than a local copy.
    global total_iterations

    # Start-time used for printing time-usage below.
    start_time = time.time()

    best_val_loss = float("inf")
    patience = 0

    for i in range(total_iterations, total_iterations + num_iterations):

        # Get a batch of training examples.
        # x_batch now holds a batch of images and
        # y_true_batch are the true labels for those images.
        x_batch, y_true_batch, _, cls_batch = data.train.next_batch(train_batch_size)
        x_valid_batch, y_valid_batch, _, valid_cls_batch = data.valid.next_batch(train_batch_size)

        # Convert shape from [num examples, rows, columns, depth]
        # to [num examples, flattened image shape]

        x_batch = x_batch.reshape(train_batch_size, img_size_flat)
        x_valid_batch = x_valid_batch.reshape(train_batch_size, img_size_flat)

        # Put the batch into a dict with the proper names
        # for placeholder variables in the TensorFlow graph.
        feed_dict_train = {x: x_batch, y_true: y_true_batch}        
        feed_dict_validate = {x: x_valid_batch, y_true: y_valid_batch}

        # Run the optimizer using this batch of training data.
        # TensorFlow assigns the variables in feed_dict_train
        # to the placeholder variables and then runs the optimizer.
        session.run(optimizer, feed_dict=feed_dict_train)        

        # Print status at end of each epoch (defined as full pass through training Preprocessor).
        if i % int(data.train.num_examples/batch_size) == 0: 
            val_loss = session.run(cost, feed_dict=feed_dict_validate)
            epoch = int(i / int(data.train.num_examples/batch_size))

            acc, val_acc = print_progress(epoch, feed_dict_train, feed_dict_validate, val_loss)
            msg = "Epoch {0} --- Training Accuracy: {1:>6.1%}, Validation Accuracy: {2:>6.1%}, Validation Loss: {3:.3f}"
            print(msg.format(epoch + 1, acc, val_acc, val_loss))
            print(acc)
            acc_list.append(acc)
            val_acc_list.append(val_acc)
            iter_list.append(epoch+1)

            if early_stopping:    
                if val_loss < best_val_loss:
                    best_val_loss = val_loss
                    patience = 0
                else:
                    patience += 1
                if patience == early_stopping:
                    break

    # Update the total number of iterations performed.
    total_iterations += num_iterations

    # Ending time.
    end_time = time.time()

    # Difference between start and end-times.
    time_dif = end_time - start_time

    # Print the time-usage.
    print("Time elapsed: " + str(timedelta(seconds=int(round(time_dif)))))

@CarlosVegas I did not solve this exactly. Instead I have been just writing the script in a python text file and running it from the terminal. Not sure why the kernel was dying. I think that you can do the training in the python script and then save the model weights, etc. Then you can import that data into jupyter notebook to say look at the predictions, etc. I was going to give this a shot again soon though. I am watching some videos where someone is using jupyter nb for CNNs. — krishnab, Jun 02 '18 at 23:35
I had similar issue in mac. found the solution in in this post https://stackoverflow.com/questions/53014306/error-15-initializing-libiomp5-dylib-but-found-libiomp5-dylib-already-initial — Aftab, Dec 27 '19 at 17:43

tensorflow: jupyter kernel dies when running Convolutional Network

0 Answers0