0

long shot: Ubuntu 16.04 Nvidia 1070 with 8Gig on board? Machine has 64 Gig ram and dataset is 1 million records and current cuda, cdnn libraries, TensorFlow 1.0 Python 3.6

Not sure how to troubleshoot?

I have been working on getting some models up with TensorFlow and have run into this phenomena a number of times:I don't know of anything other than TensorFlow using the GPU memory?

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate (GHz) 1.645 pciBusID 0000:01:00.0 Total memory: 7.92GiB Free memory: 7.56GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0) E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cu

The following I get this which indicates some sort of memory allocation is going on? yet still failing.

`I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 899200000 totalling 4.19GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1649756928 totalling 1.54GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.40GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                  8499298304
InUse:                  6875780608
MaxInUse:               6878976000
NumAllocs:                     338
MaxAllocSize:           1649756928

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ******************************************************************************************xxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 6.10MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Internal: Dst tensor is not initialized.
     [[Node: linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice/_1055 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1643_linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

` Update: I reduced the record count from millions to 40,000 and got a base model to run to completion. I still get an error message but not continuous ones. I get a bunch of text in the model output suggesting restructuring the model and I suspect that the data structure is a big part of the problem. Could still use some better hints as to how to debug the entire process.. Below is the remaining console output

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.645
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.52GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
[I 09:13:09.297 NotebookApp] Saving file at /Documents/InfluenceH/Working_copies/Cond_fcast_wkg/TensorFlow+DNNLinearCombinedClassifier+for+Influence.ipynb
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
dartdog
  • 10,432
  • 21
  • 72
  • 121
  • This unanswered question quite similar: http://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu?rq=1 – dartdog Apr 22 '17 at 22:51

1 Answers1

1

I think the problem is that TensorFlow tries to allocate 7.92GB of GPU memory, while only 7.56GB are actually free. I cannot tell you for what reason the rest of the GPU memory is occupied, but you might avoid this problem by limiting the fraction of the GPU memory your program is allowed to allocate:

sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
with tf.Session(config=sess_config, ...) as ...:

With this, the program will only allocate 90 percent of the GPU memory, i.e. 7.13GB.

ml4294
  • 2,559
  • 5
  • 24
  • 24
  • Not getting what should be in the place of the ... in the last line? Also see my update... – dartdog Apr 23 '17 at 14:46
  • 1
    The dots between the parentheses can be replaced by some other options with which you want to initialize the tf.Session(). These options should be options you may have already specified, if you have any. If you do not have any further specifications, remove the comma and the dots. Before the ":" you define the name by which you will call the tf.Session(), for example `with tf.Session(config=sess_config) as sess:` – ml4294 Apr 23 '17 at 14:52
  • Big help! still have to restructure the models I think.. but got past the initial error – dartdog Apr 23 '17 at 16:21
  • 1
    Do you load your complete "record count" of million data samples into the GPU memory. If so, this may be a reason for your memory issues. In this case it might be necessary to either reduce your record count (as you have already described) or implement some kind of sequential data readin from a file, in order to use the full data set for the training. – ml4294 Apr 23 '17 at 17:03