2

I've just set up a deep learning machine with an NVidia GTX 1080 with 11 GB. It was working pretty well until recently, it starts throwing errors when I'm running my Jupyter notebooks. They will be MemoryErrors, ResourceExhaustedErrors, or the computer freezes all together.
I typed in nvidia-smi -l 3 into the terminal, and I got this image. It looks like the memory is being used to the max, especially by that Anaconda3. Would deleting it and reinstalling it by pip, for example, solve the issue.

I've included a link to the image I'm getting in the terminal. Please check it out and let me know your suggestions.

Thank you in advance. I also copy and pasted it into this question, but looking at the image is way cleaner.

Image of table (same like below):

| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
|  5%   49C    P8     9W / 250W |  11122MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1110      G   /usr/lib/xorg/Xorg                            16MiB |
|    0      1147      G   /usr/bin/gnome-shell                          50MiB |
|    0      1397      G   /usr/lib/xorg/Xorg                            90MiB |
|    0      1526      G   /usr/bin/gnome-shell                          40MiB |
|    0      1967      G   /usr/lib/firefox/firefox                       2MiB |
|    0      2283      G   /usr/lib/firefox/firefox                       2MiB |
|    0      2512      C   /home/vivek/anaconda3/bin/python           10515MiB |
|    0      2803      C   /home/vivek/anaconda3/bin/python             383MiB |
+-----------------------------------------------------------------------------+
MBT
  • 21,733
  • 19
  • 84
  • 102
vreddy
  • 21
  • 1
  • Does this happen while training your ML model defined in the Jupyter notebook? What if you try running a simple model, does it still happen? – Jeppe Sep 08 '18 at 07:41
  • TensorFlow does this by default, it allocates all memory and manages it internally. – Dr. Snoopy Sep 08 '18 at 08:48
  • @Jeppe Yes this happens when I train my ML model defined in the Jupyter notebook. I tried running a simple 3 layer MNIST MLP (also in Jupyter), and that works fine. – vreddy Sep 08 '18 at 18:39
  • @MatiasValdenegro Can you please expand on what you mean? Are you saying that it's allocating that much memory but not necessary using that much? – vreddy Sep 08 '18 at 18:40
  • @Jeppe After working the simple model, I went back and compiled my full model, but I fit it with only 1000 images instead of the full 9068. Once I run the fit command, the terminal gives this message repeatedly: `2018-09-08 11:50:54.448098: E tensorflow/stream_executor/cuda/cuda_driver.cc:903] failed to allocate 2.2K (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY` Then it says `2018-09-08 11:50:55.074973: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 512B. Current allocation summary follows.` – vreddy Sep 08 '18 at 18:58
  • The only times I've encountered memory errors like this (not sure it was the exact same error), was with too large dimensions in my MLP layers (fully connected). What are the sizes of your layers? First step would be to confirm whether this is a problem with your setup or your ML model. Otherwise try rebooting or ending existing processes and try running it again as proposed here: https://stackoverflow.com/a/46021109/3717691 – Jeppe Sep 09 '18 at 19:17
  • If OOM happens during training (call to `fit`) then you may want to check [this](https://stackoverflow.com/a/51183870/1735003) and see if it helps. – P-Gn Sep 10 '18 at 07:22

0 Answers0