7

I have several GPUs but I only want to use one GPU for my training. I am using following options:

config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)
config.gpu_options.allow_growth = True

with tf.Session(config=config) as sess:

Despite setting / using all these options, all of my GPUs allocate memory and

#processes = #GPUs

How can I prevent this from happening?

Note

  1. I do not want use set the devices manually and I do not want to set CUDA_VISIBLE_DEVICES since I want tensorflow to automatically find the best (an idle) GPU available
  2. When I try to start another run it uses the same GPU that is already used by another tensorflow process even though there are several other free GPUs (apart from the memory allocation on them)
  3. I am running tensorflow in a docker container: tensorflow/tensorflow:latest-devel-gpu-py
  • It seems very weird. could you please try to post the full code and the TF version you're using? – Matan Hugi Dec 20 '17 at 21:01
  • did you try to specify initial memory fraction? `gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))` https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory – y.selivonchyk Dec 20 '17 at 21:26
  • The full code is more than 5 scripts, so I cannot show you the full code, unfortunately, but I think I made my point clear? Or is there anything specific you would like to see? I have added the tensorflow version I am working on. @MatanHugi –  Dec 21 '17 at 07:27
  • No, I have not, but I am sure this will not help me with my problem. @yauheni_selivonchyk –  Dec 21 '17 at 07:29
  • Tensorflow has no logic to find the beast (idle) GPU available. – Alexandre Passos Dec 21 '17 at 21:04
  • Hm, okay that's bad but okay, at least it shouldn't allocate memory on every single GPU neither should it start as many processes as there are GPUs –  Dec 22 '17 at 07:46

2 Answers2

19

I had this problem my self. Setting config.gpu_options.allow_growth = True Did not do the trick, and all of the GPU memory was still consumed by Tensorflow. The way around it is the undocumented environment variable TF_FORCE_GPU_ALLOW_GROWTH (I found it in https://github.com/tensorflow/tensorflow/blob/3e21fe5faedab3a8258d344c8ad1cec2612a8aa8/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc#L25)

Setting TF_FORCE_GPU_ALLOW_GROWTH=true works perfectly.

In the Python code, you can set

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
GoingMyWay
  • 16,802
  • 32
  • 96
  • 149
Guy Geva
  • 341
  • 2
  • 5
  • 2
    Great! But note that if you are using SLURM, you will need to add "export TF_FORCE_GPU_ALLOW_GROWTH=true" to your sbatch script, because you won't be able to set environment variables from within python! – midawn98 Apr 28 '20 at 21:07
0

I can offer you a method mask_busy_gpus defined here: https://github.com/yselivonchyk/TensorFlow_DCIGN/blob/master/utils.py

Simplified version of the function:

import subprocess as sp
import os

def mask_unused_gpus(leave_unmasked=1):
  ACCEPTABLE_AVAILABLE_MEMORY = 1024
  COMMAND = "nvidia-smi --query-gpu=memory.free --format=csv"

  try:
    _output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
    memory_free_info = _output_to_list(sp.check_output(COMMAND.split()))[1:]
    memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
    available_gpus = [i for i, x in enumerate(memory_free_values) if x > ACCEPTABLE_AVAILABLE_MEMORY]

    if len(available_gpus) < leave_unmasked: ValueError('Found only %d usable GPUs in the system' % len(available_gpus))
    os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(map(str, available_gpus[:leave_unmasked]))
  except Exception as e:
    print('"nvidia-smi" is probably not installed. GPUs are not masked', e)

Usage:

mask_unused_gpus()
with tf.Session()...

Prerequesities: nvidia-smi

With this script I was solving next problem: on a multy-GPU cluster use only single (or arbitrary) number of GPUs allowing them to be automatically allocated.

Shortcoming of the script: if you are starting multiple scripts at once random assignment might cause same GPU assignment, because script depends on memory allocation and memory allocation takes some seconds to kick in.

y.selivonchyk
  • 8,987
  • 8
  • 54
  • 77