PyTorch RuntimeError: DataLoader worker (pid(s) 15332) exited unexpectedly

Question

I am a beginner at PyTorch and I am just trying out some examples on this webpage. But I can't seem to get the 'super_resolution' program running due to this error:

RuntimeError: DataLoader worker (pid(s) 15332) exited unexpectedly

I searched the Internet and found that some people suggest setting num_workers to 0. But if I do that, the program tells me that I am running out of memory (either with CPU or GPU):

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 9663676416 bytes. Buy new RAM!

or

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 4.00 GiB total capacity; 2.03 GiB already allocated; 0 bytes free; 2.03 GiB reserved in total by PyTorch)

How do I fix this?

I am using python 3.8 on Win10(64bit) and pytorch 1.4.0.

More complete error messages (--cuda means using GPU, --threads x means passing x to the num_worker parameter):

with command line arguments --upscale_factor 1 --cuda

  File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "E:\Python38\lib\multiprocessing\queues.py", line 108, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Z:\super_resolution\main.py", line 81, in <module>
    train(epoch)
  File "Z:\super_resolution\main.py", line 48, in train
    for iteration, batch in enumerate(training_data_loader, 1):
  File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
    data = self._next_data()
  File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "E:\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 16596, 9376, 12756, 9844) exited unexpectedly

with command line arguments --upscale_factor 1 --cuda --threads 0

  File "Z:\super_resolution\main.py", line 81, in <module>
    train(epoch)
  File "Z:\super_resolution\main.py", line 52, in train
    loss = criterion(model(input), target)
  File "E:\Python38\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "Z:\super_resolution\model.py", line 21, in forward
    x = self.relu(self.conv2(x))
  File "E:\Python38\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "E:\Python38\lib\site-packages\torch\nn\modules\conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "E:\Python38\lib\site-packages\torch\nn\modules\conv.py", line 341, in conv2d_forward
    return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 4.00 GiB total capacity; 2.03 GiB already allocated; 954.35 MiB free; 2.03 GiB reserved in total by PyTorch)

Aneesh Cherian K · Answer 1 · 2022-02-21T13:39:45.103

25

This is the solution that worked for me. it may work for other Windows users. Just remove/comment the num workers to disable parallel loads

edited Feb 21 '22 at 13:39

answered Feb 02 '22 at 00:52

Aneesh Cherian K

396
5
7

2

setting `num workers=0` should generally give you a better trace and actually tell you what is going wrong if the error persists or otherwise give a smooth but slower non-parallel run on a slower machine – dr_dronych Dec 12 '22 at 15:18
this is also very platform specific. Win/MacOS vs Unix have different ways of multiprocessing. read more here on how to properly accommodate for that fact https://pytorch.org/docs/stable/data.html#platform-specific-behaviors – dr_dronych Dec 13 '22 at 02:51

score 13 · Accepted Answer · answered Feb 06 '20 at 18:47

There is no "complete" solve for GPU out of memory errors, but there are quite a few things you can do to relieve the memory demand. Also, make sure that you are not passing the trainset and testset to the GPU at the same time!

Decrease batch size to 1
Decrease the dimensionality of the fully-connected layers (they are the most memory-intensive)
(Image data) Apply centre cropping
(Image data) Transform RGB data to greyscale
(Text data) Truncate input at n chars (which probably won't help that much)

Alternatively, you can try running on Google Colaboratory (12 hour usage limit on K80 GPU) and Next Journal, both of which provide up to 12GB for use, free of charge. Worst case scenario, you might have to conduct training on your CPU. Hope this helps!

score 3 · Answer 3 · answered Oct 10 '22 at 17:37

3

On windows Aneesh Cherian's solution works well for notebooks (IPython). But if you want to use num_workers>0 you should avoid interpreters like IPython and put the dataload in if __name__ == '__main__:. Also, with persistent_workers=True the dataload appears to be faster on windows if num_workers>0.

More information can be found in this thread: https://github.com/pytorch/pytorch/issues/12831

answered Oct 10 '22 at 17:37

gaspar

59
1
3

1

for all those who dont want to just leave it unparalleled this is useful – dr_dronych Dec 13 '22 at 02:53

score 1 · Answer 4 · answered Feb 11 '20 at 04:11

1

Restart your system for the GPU to regain its memory. Save all the work and restart your System.

answered Feb 11 '20 at 04:11

Gokhulan Damodaran

11
2

score 1 · Answer 5 · answered Nov 12 '21 at 16:14

1

I tried to fine-tuning it using different combinations. The solution for me is on batch_size = 1 and n_of_jobs=8

answered Nov 12 '21 at 16:14

Iasonas Christoulakis

27
4

This does not really answer the question. If you have a different question, you can ask it by clicking [Ask Question](https://stackoverflow.com/questions/ask). To get notified when this question gets new answers, you can [follow this question](https://meta.stackexchange.com/q/345661). Once you have enough [reputation](https://stackoverflow.com/help/whats-reputation), you can also [add a bounty](https://stackoverflow.com/help/privileges/set-bounties) to draw more attention to this question. - [From Review](/review/late-answers/30326221) – gshpychka Nov 12 '21 at 20:36
As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 12 '21 at 20:48

score 1 · Answer 6 · answered Mar 04 '23 at 15:51

1

I was working with mmaction trainer when this error showed up. What worked for me was:

cfg.data['workers_per_gpu']=0

Where cfg is the configuration, for training.

answered Mar 04 '23 at 15:51

Om Rastogi

478
1
5
12

score 0 · Answer 7 · edited Jan 06 '22 at 15:40

0

Reduce number of workers, -- threads x in your case.

edited Jan 06 '22 at 15:40

Dyd666

705
1
11
30

answered Jan 05 '22 at 06:15

Chris Cao

1

score 0 · Answer 8 · answered Nov 26 '22 at 02:39

0

As the accepted response states. There is no explicit solution. However, in my case, I had to resize all images as the images were large and model huge. You can refer to this post for resizing: https://stackoverflow.com/a/73798986/16599761

answered Nov 26 '22 at 02:39

Timilehin A.

99
5

score 0 · Answer 9 · answered Apr 07 '23 at 08:00

April 2023: I believe that we have to make a proper balance between workers and batch_size parameters.

Cause of Error and Solution: I am training a network with batch_size=3096 and workers=32. I have simply changed the batch_size=1024 and the error goes away, the workers=32 remains the same.

score 0 · Answer 10 · answered Apr 14 '23 at 10:35

0

In my case the problem was related to DataLoader failing to access root directory like path.join('..', 'data', 'my_dateset')

answered Apr 14 '23 at 10:35

sixtytrees

1,156
1
10
25

PyTorch RuntimeError: DataLoader worker (pid(s) 15332) exited unexpectedly

10 Answers10

Linked