ResourceExhaustedError using cnn

Question

While trying to run my 3D Convolutional Neural Network, I am getting the following error. What could be the reason ?

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[54080,1024] [[Node: Variable_10/Adam/Assign = Assign[T=DT_FLOAT, _class=["loc:@Variable_10"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_10/Adam, zeros_4)]]

This is the code i have used:

import tensorflow as tf
import numpy as np

IMG_SIZE_PX = 50
SLICE_COUNT = 20

n_classes = 2
batch_size = 10

x = tf.placeholder('float')
y = tf.placeholder('float')

keep_rate = 0.8
def conv3d(x, W):
    return tf.nn.conv3d(x, W, strides=[1,1,1,1,1], padding='SAME')

def maxpool3d(x):
    return tf.nn.max_pool3d(x, ksize=[1,2,2,2,1], strides=[1,2,2,2,1], padding='SAME')

def convolutional_neural_network(x):

    weights = {'W_conv1':tf.Variable(tf.random_normal([3,3,3,1,32])),

               'W_conv2':tf.Variable(tf.random_normal([3,3,3,32,64])),

               'W_fc':tf.Variable(tf.random_normal([54080,1024])),
               'out':tf.Variable(tf.random_normal([1024, n_classes]))}

    biases = {'b_conv1':tf.Variable(tf.random_normal([32])),
               'b_conv2':tf.Variable(tf.random_normal([64])),
               'b_fc':tf.Variable(tf.random_normal([1024])),
               'out':tf.Variable(tf.random_normal([n_classes]))}


    x = tf.reshape(x, shape=[-1, IMG_SIZE_PX, IMG_SIZE_PX, SLICE_COUNT, 1])

    conv1 = tf.nn.relu(conv3d(x, weights['W_conv1']) + biases['b_conv1'])
    conv1 = maxpool3d(conv1)


    conv2 = tf.nn.relu(conv3d(conv1, weights['W_conv2']) + biases['b_conv2'])
    conv2 = maxpool3d(conv2)

    fc = tf.reshape(conv2,[-1, 54080])
    fc = tf.nn.relu(tf.matmul(fc, weights['W_fc'])+biases['b_fc'])
    fc = tf.nn.dropout(fc, keep_rate)

    output = tf.matmul(fc, weights['out'])+biases['out']

    return output

much_data = np.load('muchdata-50-50-20.npy')
# If you are working with the basic sample data, use maybe 2 instead of 100 here... you don't have enough data to really do this
train_data = much_data[:-100]
validation_data = much_data[-100:]


def train_neural_network(x):
    prediction = convolutional_neural_network(x)
    cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y) )
    optimizer = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(cost)

    hm_epochs = 10
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        successful_runs = 0
        total_runs = 0

        for epoch in range(hm_epochs):
            epoch_loss = 0
            for data in train_data:
                total_runs += 1
                try:
                    X = data[0]
                    Y = data[1]
                    _, c = sess.run([optimizer, cost], feed_dict={x: X, y: Y})
                    epoch_loss += c
                    successful_runs += 1
                except Exception as e:
                    # I am passing for the sake of notebook space, but we are getting 1 shaping issue from one 
                    # input tensor. Not sure why, will have to look into it. Guessing it's
                    # one of the depths that doesn't come to 20.
                    pass
                    #print(str(e))

            print('Epoch', epoch+1, 'completed out of',hm_epochs,'loss:',epoch_loss)

            correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
            accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

            print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))

        print('Done. Finishing accuracy:')
        print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))

        print('fitment percent:',successful_runs/total_runs)

train_neural_network(x)

I am running this using tensorflow-gpu version. I am using a GTX970M and have installed CUDA and imported the cudnn files properly. When running the last command i get the following error. Kindly help !

how much memory does your GTX970M have ? – Jason Apr 14 '17 at 09:14 — Jason, Apr 14 '17 at 09:14

score 0 · Accepted Answer · edited May 23 '17 at 12:32

0

You have ran out of memory for some reasons. It can be that you have some application using your GPU (another tensorflow session still active for example).Check if that's not the case. (you can use nvidia-smi to monitor that).

If that's not, it will mainly be because of the size of your model and the size of your GPU's memory. What you can do is try to launch it in CPU mode, list all your variables with tf.Variables, do the math of how much memory it represents, and see if it fit in your GPU.

Until you have done this, there not much more advice I can provide.

edited May 23 '17 at 12:32

Community

1
1

answered Apr 14 '17 at 09:35

Arnaud De Broissia

659
3
10

Yes, the problem was that tensorflow of the previous session was still using my GPU even if the process had been stopped. I had to close the cmd and try again for it to start using the GPU. – Ashwanth Ramji Apr 17 '17 at 23:08

ResourceExhaustedError using cnn

1 Answers1