0

I am using Glove pre-trained embedding to train my own network. I use

self.embedding = tf.get_variable(name="embedding", shape=self.id2vec_table.shape, initializer=tf.constant_initializer(self.id2vec_table), trainable=False)

and tuning_embedding = tf.nn.embedding_lookup(self.embedding, self.txt_from_mfcc)

to initialize and look up embedding. However when I did the training, the error shows as (the error message is too long, I add here the most important ones I believe)

Sum Total of in-use chunks: 3.85GiB, Limit:
11281927373 InUse: 4131524096 MaxInUse:
6826330624 NumAllocs: 47061 MaxAllocSize:
2842165248 OP_REQUIRES failed at matmul_op.cc:478 : Resource exhausted: OOM when allocating tensor with shape[4800,400001] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

However, as from the error STATS, my max memory of tesla k80 is 11G, and here it is only used for 40%-70% - around 4 ~7 G, how can my gpu be out of memory since it only uses up at most 70% of the total max memory? I just cannot understand the inner mechanism of how it works.

I also has tried methods from other post such as from https://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu

and limit my batch size to 16 or config.gpu_options.allow_growth = True or config.gpu_options.allocator_type = 'BFC' or config.gpu_options.per_process_gpu_memory_fraction = 0.4, the error is still here.

Any help here ?

exteral
  • 991
  • 2
  • 12
  • 33

0 Answers0