4

Hi I have a question about the poolallocator. When I start my training job, it took several hours trying to do the "PoolAllocator". Some logs are shown below. Is there a way to debug/profile the reason? How can I improve it? Thanks!

tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 9639 get requests, put_count=4341 evicted_count=1000 eviction_rate=0.230362 and unsatisfied allocation rate=0.663762
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=2013 evicted_count=2000 eviction_rate=0.993542 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 7080 get requests, put_count=6922 evicted_count=5000 eviction_rate=0.722335 and unsatisfied allocation rate=0.730791
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 176 to 193
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=2025 evicted_count=2000 eviction_rate=0.987654 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=5030 evicted_count=5000 eviction_rate=0.994036 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=2044 evicted_count=2000 eviction_rate=0.978474 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 9617 get requests, put_count=8892 evicted_count=5000 eviction_rate=0.562303 and unsatisfied allocation rate=0.600915
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 596 to 655


I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=2087 evicted_count=2000 eviction_rate=0.958313 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=3095 evicted_count=3000 eviction_rate=0.969305 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=1115 evicted_count=1000 eviction_rate=0.896861 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=1140 evicted_count=1000 eviction_rate=0.877193 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=1169 evicted_count=1000 eviction_rate=0.855432 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=1204 evicted_count=1000 eviction_rate=0.830565 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=2247 evicted_count=2000 eviction_rate=0.890076 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=8272 evicted_count=8000 eviction_rate=0.967118 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=2362 evicted_count=2000 eviction_rate=0.84674 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 10614 get requests, put_count=10944 evicted_count=2000 eviction_rate=0.182749 and unsatisfied allocation rate=0.198606
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 4823 to 5305
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=3705 evicted_count=3000 eviction_rate=0.809717 and unsatisfied allocation rate=0
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 4204990 get requests, put_count=4204742 evicted_count=3000 eviction_rate=0.00071348 and unsatisfied allocation rate=0.00104257
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 16377314 get requests, put_count=16374197 evicted_count=13000 eviction_rate=0.000793932 and unsatisfied allocation rate=0.00105347
psh
  • 41
  • 1
  • 3

2 Answers2

1

Actually the job is working fine. The problem is the output buffering of python. It won't show any training results until the job is finished. You can disable the buffering by:

sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) 

or you can try other methods in: Disable output buffering

Community
  • 1
  • 1
Yan Zhang
  • 11
  • 1
1

The job is working fine. These messages should only be a cause for concern if you run out of memory.

Answers in this question may help you. How to interpret Poolallocator messages in tensorflow?

Community
  • 1
  • 1
hankaixyz
  • 96
  • 1
  • 5