Too many open files error when training in loop

Question

I have a pretty simple setup like this:

while True:
   model.fit(mySeqGen(xx), ..., use_multiprocessing=True, workers=8)
   <stuff>
   model.save_weights(fname)
   gc.collect()

Which runs for a long time, but if left overnight I will find it generating OSError: [Errno 24] Too many open files every loop iteration. The full stack trace is on another machine, but it has multiple references to multiprocessing.

This is surely not related to actual files, but byproducts of threads being created under the hood and not cleaned up properly. Is there a simple way I can make this loop stable over the long run and clean up after itself each pass?

Could you post the body of the `mySeqGen` function? It's may be happening in there if your generator is streaming data from your file system. — danielcahall, Apr 13 '22 at 23:53
I can post something soon to give an idea, but it's actually a simulator and does not go to disk in any way. — Mastiff, Apr 13 '22 at 23:55
I assume you are using keras. Which keras version are you using? Are you sure this is not actually related to too many files being opened and not being properly closed by keras itself? AFAIK `gc.collect()` does not guarantee closing of files. What is `fname` in your code, which filetype are you saving to? — rvd, Apr 14 '22 at 14:26
Could it be related to sockets? https://stackoverflow.com/questions/39537731/errno-24-too-many-open-files-but-i-am-not-opening-files — rvd, Apr 14 '22 at 14:32
@rvd Yes, I think it is sockets related to the threads. But how can I force them to close? Or is this a bug in Tensorflow? I'm using Tensorflow's keras within TF version 2.5. No files are ever opened, unless you count the save, but the error stack is all about "../python3.8/multiprocessing/pool.py". Fname is just the name where I have TF save the weights, "myModelWeights" or something. — Mastiff, Apr 14 '22 at 15:42
@Mastiff I asked after the filename to see if it is an HDF file object or a '.hdf' string to see if there is a bug in the TF saving code here https://github.com/keras-team/keras/blob/d8fcb9d4d4dad45080ecfdd575483653028f8eda/keras/saving/hdf5_format.py#L136. — rvd, Apr 14 '22 at 15:51
It could be a bug in tensorflow, are you able to test it for tensorflow 2.8? Otherwise, it's your generator `mySeqGen`. Similar issues can be found here https://github.com/tensorflow/tensorflow/issues/43738 and https://github.com/tensorflow/federated/issues/833. — rvd, Apr 14 '22 at 16:01
Could it be that your `fname` is the same in every loop (i.e. you want to overwrite the weights file), but you do not set `model.save_weights(fname, overwrite=True)`? Are the weights actually saved before you encounter the error? — rvd, Apr 14 '22 at 16:18
And one last shot: can you use `tf.Data` to wrap your generator? That might help TF to handle the sockets correctly. Check out `tf.data.Dataset.from_generator` here https://www.tensorflow.org/guide/data. — rvd, Apr 14 '22 at 16:30
Thanks for all the leads. The filename is pretty simple, of the form "myweights_12_3", so I'd be surprised if that was an issue. And at least looking on the OS, the weights files are updated regularly. And, the fact that the error trace indicates "multiprocessor" all over the place, makes me think it's not a normal file, but a thread/socket thing. — Mastiff, Apr 14 '22 at 16:48
It must be the generator. What exactly are you doing there? Are you storing something in a list? [Related question](https://stackoverflow.com/questions/33694149/multiprocessing-oserror-errno-24-too-many-open-files-how-to-clean-up-jobs) — rvd, Apr 15 '22 at 08:41
Correction: I have a hunch that it is the generator. Somewhere, file descriptors are not GC'd. I saw three options: A&B an internal bug in TF (possible, but not very likely) of two flavours: A) weight saving not closing the files, B) the multiprocessing not being GC'd due to an internal bug, and C) your generator doing something that prevents the GC to properly collect file descriptors that are generated, similar to [this situation](https://stackoverflow.com/questions/33694149/multiprocessing-oserror-errno-24-too-many-open-files-how-to-clean-up-jobs). That's all I have. — rvd, Apr 17 '22 at 18:26
The generator is a simulator. You can imagine it just sending back random ins/outs of the correct size and that would be a close enough model. There is zero disk access and no internal multiprocessing. I was hoping someone just knew about this problem, but it looks like my options will be to either live with it or create a toy example from scratch that has the issue so people have something to play with. — Mastiff, Apr 18 '22 at 17:58

lazy · Answer 1 · 2022-04-14T06:19:02.623

It may be a system limitation.

Enter the following command:

$ ulimit -n

one thousand and twenty-four

The result is 1024, which means that the system is limited to 1024 files open at the same time.

Modification method:

Reduce the number of threads to meet this limit (just method).
Increase this limit.

a. Ulimit - N 2048 (this method is temporarily modified and is currently valid. Restore the original setting after exiting)

b. Modify the following files

sudo vim /etc/security/limits. conf

soft nofile 2048

hard nofile 2048

Restart after saving.

*unlimited

Data segment length: ulimit - D unlimited

Maximum memory size: ulimit - M unlimited

Stack size: ulimit - s unlimited

This just delays the symptom, not solve the problem. – aaron Apr 14 '22 at 05:28 — aaron, Apr 14 '22 at 05:28

score 0 · Answer 2 · answered Apr 20 '22 at 05:22

You can monitor the number of threads, connections and files opened by the process:

import psutil

p = psutil.Process()
while True:
   model.fit(mySeqGen(xx), ..., use_multiprocessing=True, workers=8)
   <stuff>
   model.save_weights(fname)
   gc.collect()
   print(f'files={len(p.open_files())} conn={len(p.connections(kind='tcp'))} threads={p.num_threads()}')

and at least you would know what problem to solve.

Too many open files error when training in loop

2 Answers2