Say I have access to a number of GPUs in a single machine (for the sake of argument assume 8GPUs each with max memory of 8GB each in one single machine with some amount of RAM and disk). I wanted to run in one single script and in one single machine a program that evaluates multiple models (say 50 or 200) in TensorFlow, each with a different hyper parameter setting (say, step-size, decay rate, batch size, epochs/iterations, etc). At the end of training assume we just record its accuracy and get rid of the model (if you want assume the model is being check pointed every so often, so its fine to just throw away the model and start training from scratch. You may also assume some other data may be recorded like the specific hyper params, train, validation, train errors are recorded as we train etc).
Currently I have a (pseudo-)script that looks as follow:
def train_multiple_modles_in_one_script_with_gpu(arg):
'''
trains multiple NN models in one session using GPUs correctly.
arg = some obj/struct with the params for trianing each of the models.
'''
#### try mutliple models
for mdl_id in range(100):
#### define/create graph
graph = tf.Graph()
with graph.as_default():
### get mdl
x = tf.placeholder(float_type, get_x_shape(arg), name='x-input')
y_ = tf.placeholder(float_type, get_y_shape(arg))
y = get_mdl(arg,x)
### get loss and accuracy
loss, accuracy = get_accuracy_loss(arg,x,y,y_)
### get optimizer variables
opt = get_optimizer(arg)
train_step = opt.minimize(loss, global_step=global_step)
#### run session
with tf.Session(graph=graph) as sess:
# train
for i in range(nb_iterations):
batch_xs, batch_ys = get_batch_feed(X_train, Y_train, batch_size)
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
# check_point mdl
if i % report_error_freq == 0:
sess.run(step.assign(i))
#
train_error = sess.run(fetches=loss, feed_dict={x: X_train, y_: Y_train})
test_error = sess.run(fetches=loss, feed_dict={x: X_test, y_: Y_test})
print( 'step %d, train error: %s test_error %s'%(i,train_error,test_error) )
essentially it tries lots of models in one single run but it builds each model in a separate graph and runs each one in a separate session.
I guess my main worry is that its unclear to me how tensorflow under the hood allocates resources for the GPUs to be used. For example, does it load the (part of the) data set only when a session is ran? When I create a graph and a model, is it brought in the GPU immediately or when is it inserted in the GPU? Do I need to clear/free the GPU each time it tries a new model? I don't actually care too much if the models are ran in parallel in multiple GPU (which can be a nice addition), but I want it to first run everything serially without crashing. Is there anything special I need to do for this to work?
Currently I am getting an error that starts as follow:
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 340000768
InUse: 336114944
MaxInUse: 339954944
NumAllocs: 78
MaxAllocSize: 335665152
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************************xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 160.22MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[60000,700]
and further down the line it says:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[60000,700]
[[Node: standardNN/NNLayer1/Z1/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](standardNN/NNLayer1/Z1/MatMul, b1/read)]]
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
however further down the output file (where it prints) it seems to print fine the errors/messages that should show as training proceeds. Does this mean that it didn't run out of resources? Or was it actually able to use the GPU? If it was able to use the CPU instead of the CPU, when why is this an error only happening when GPU are about to be used?
The weird thing is that the data set is really not that big (all 60K points are 24.5M) and when I run a single model locally in my own computer it seems that the process uses less than 5GB. The GPUs have at least 8GB and the computer with them has plenty of RAM and disk (at least 16GB). Thus, the errors that tensorflow is throwing at me are quite puzzling. What is it trying to do and why are they occurring? Any ideas?
After reading the answer that suggests to use the multiprocessing library I came up with the following script:
def train_mdl(args):
train(mdl,args)
if __name__ == '__main__':
for mdl_id in range(100):
# train one model with some specific hyperparms (assume they are chosen randomly inside the funciton bellow or read from a config file or they could just be passed or something)
p = Process(target=train_mdl, args=(args,))
p.start()
p.join()
print('Done training all models!')
honestly I am not sure why his answer suggests to use pool, or why there are weird tuple brackets but this is what would make sense for me. Would the resources for tensorflow be re-allocated every time a new process is created in the above loop?