I am using Glove pre-trained embedding to train my own network. I use
self.embedding = tf.get_variable(name="embedding", shape=self.id2vec_table.shape, initializer=tf.constant_initializer(self.id2vec_table), trainable=False)
and tuning_embedding = tf.nn.embedding_lookup(self.embedding, self.txt_from_mfcc)
to initialize and look up embedding. However when I did the training, the error shows as (the error message is too long, I add here the most important ones I believe)
Sum Total of in-use chunks: 3.85GiB, Limit:
11281927373 InUse: 4131524096 MaxInUse:
6826330624 NumAllocs: 47061 MaxAllocSize:
2842165248 OP_REQUIRES failed at matmul_op.cc:478 : Resource exhausted: OOM when allocating tensor with shape[4800,400001] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
However, as from the error STATS, my max memory of tesla k80 is 11G, and here it is only used for 40%-70% - around 4 ~7 G, how can my gpu be out of memory since it only uses up at most 70% of the total max memory? I just cannot understand the inner mechanism of how it works.
I also has tried methods from other post such as from https://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu
and limit my batch size to 16 or config.gpu_options.allow_growth = True
or config.gpu_options.allocator_type = 'BFC'
or config.gpu_options.per_process_gpu_memory_fraction = 0.4
, the error is still here.
Any help here ?