I have come across trouble regarding performance of my scripts, while generating and using large quantity of small files.
I have two directories on my disk (same behavior on HDD and SSD). First with ~10_000 input files and second for ~1_300_000 output files. I wrote script to process files and generate output using multiprocessing
library in Python.
First output 400_000-600_files (not sure when I hit 'threshold') are generated at constant pace and all 8 cores of CPU are used at 100%. Then it gets much worse. Performance decreases 20 times and cores usage drops to 1-3%, when hitting 1_000_000 files in directory.
I omitted this issue by creating second output directory and writing second half of output files there (I needed quick hotfix).
Now, I have two questions:
1) How is creating new and writing to it executed in Python on Windows? What is the bottleneck here? (my guess is that Windows look up if file already exists in directory before writing to it)
2) What is more elegant way (than splitting into dirs) to handle this issue correctly?