4

I am training a model built with TF. At the first epoch, TF is slower than the next epochs by a factor of *100 and I am seeing messages like:

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 958 to 1053

As suggested here, I tried to use tcmalloc by setting LD_PRELOAD="/usr/lib/libtcmalloc.so", but it didn't help.

Any idea on how to make the first epoch run faster?

Yuval Atzmon
  • 5,645
  • 3
  • 41
  • 74

1 Answers1

1

It seems that it is a hardware issue. For the first epoch TF (the same as other DL libraries, like PyTorch as discussed here) caching information about data as discussed here by @ppwwyyxx

If each data has different size, TF can spend a large amount of time running cudnn benchmarks for each data and store them in cache

Eugene
  • 130
  • 3
  • 17