Understand the OOM mechanism of tensorflow

Question

I am using Glove pre-trained embedding to train my own network. I use

self.embedding = tf.get_variable(name="embedding", shape=self.id2vec_table.shape, initializer=tf.constant_initializer(self.id2vec_table), trainable=False)

and tuning_embedding = tf.nn.embedding_lookup(self.embedding, self.txt_from_mfcc)

to initialize and look up embedding. However when I did the training, the error shows as (the error message is too long, I add here the most important ones I believe)

Sum Total of in-use chunks: 3.85GiB, Limit:
11281927373 InUse: 4131524096 MaxInUse:
6826330624 NumAllocs: 47061 MaxAllocSize:
2842165248 OP_REQUIRES failed at matmul_op.cc:478 : Resource exhausted: OOM when allocating tensor with shape[4800,400001] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

However, as from the error STATS, my max memory of tesla k80 is 11G, and here it is only used for 40%-70% - around 4 ~7 G, how can my gpu be out of memory since it only uses up at most 70% of the total max memory? I just cannot understand the inner mechanism of how it works.

I also has tried methods from other post such as from https://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu

and limit my batch size to 16 or config.gpu_options.allow_growth = True or config.gpu_options.allocator_type = 'BFC' or config.gpu_options.per_process_gpu_memory_fraction = 0.4, the error is still here.

Any help here ?

It looks like that tensor alone takes up around 7GB, what about the tensors for your model? — Cory Nezin, Dec 14 '18 at 01:27

Understand the OOM mechanism of tensorflow

0 Answers0