Tensorflow first epoch is extremely slow (maybe related to pool_allocator)

Question

I am training a model built with TF. At the first epoch, TF is slower than the next epochs by a factor of *100 and I am seeing messages like:

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 958 to 1053

As suggested here, I tried to use tcmalloc by setting LD_PRELOAD="/usr/lib/libtcmalloc.so", but it didn't help.

Any idea on how to make the first epoch run faster?

Do you set any options so that TF does not gulp all GPU memory? — Siyuan Ren, Jan 05 '18 at 07:26

score 1 · Answer 1 · answered Nov 14 '19 at 13:55

It seems that it is a hardware issue. For the first epoch TF (the same as other DL libraries, like PyTorch as discussed here) caching information about data as discussed here by @ppwwyyxx

If each data has different size, TF can spend a large amount of time running cudnn benchmarks for each data and store them in cache

Tensorflow first epoch is extremely slow (maybe related to pool_allocator)

1 Answers1