2

I have come across trouble regarding performance of my scripts, while generating and using large quantity of small files.

I have two directories on my disk (same behavior on HDD and SSD). First with ~10_000 input files and second for ~1_300_000 output files. I wrote script to process files and generate output using multiprocessing library in Python.

First output 400_000-600_files (not sure when I hit 'threshold') are generated at constant pace and all 8 cores of CPU are used at 100%. Then it gets much worse. Performance decreases 20 times and cores usage drops to 1-3%, when hitting 1_000_000 files in directory.
I omitted this issue by creating second output directory and writing second half of output files there (I needed quick hotfix).

Now, I have two questions:
1) How is creating new and writing to it executed in Python on Windows? What is the bottleneck here? (my guess is that Windows look up if file already exists in directory before writing to it)
2) What is more elegant way (than splitting into dirs) to handle this issue correctly?

Jan Jurec
  • 49
  • 1
  • 13
  • 1
    On NTFS you should disable short filename generation if you have directories with that many files (technically, creating them is the slow part, accessing should be fine). – Joey Feb 08 '17 at 11:32
  • @Joey, why would short filename generation decrease performance of creating new files? Does 'Windows' check if shortened name already exists in directory? – Jan Jurec Feb 08 '17 at 11:39
  • Well, yes, the short file names may not clash. So if you have lots of files with the same prefix, or just generally lots of files (if the prefix shrinks too much, generation switches to a hash-based name variant), then Windows has to find a suitable name that doesn't clash with anything already there. After all, short file names don't change for a file, ever, so you have to ensure they're unique at creation time. – Joey Feb 08 '17 at 13:04
  • Disabling short file name generation will help, but you should still expect a significant loss of performance with so many files in a single directory. If possible, use a hash function to split the output files into (say) a thousand sub-directories. – Harry Johnston Feb 08 '17 at 21:54
  • You should keep in mind that directory contents are kept into B-trees. Lookup in a B-tree is quite fast but greater than 0. :) Also, adding thousands of nodes means having several index records which may be spread all over your drive. – Andrea Lazzarotto Feb 10 '17 at 18:41

1 Answers1

0

In case anyone has the same problem, the bottleneck turned out to be lookup time for files in crowded directories.

I resolved the issue by splitting files into separate directories grouped by one parameter with even distribution over 20 different variables. Though now I would do it in a different way.

I recommend solving a similar issue with shelve Python built-in module. A shelve is one file in the filesystem and you can access it like a dictionary and put pickles inside. Just like in real life :) Example here.

Jan Jurec
  • 49
  • 1
  • 13