0

I'm a new python programmer and have code that manipulates a big number of files, operations like compressing, uncompressing, copying. To improve performance, multiprocessing is used, something like:

pool = Pool(4)
pool.map(do_task, tasks)

There is some execution time savings which dropped from 75 to 55 seconds. Changing the number of processes doesn't seem to have an impact.

I also tried to use multi-thread, the result is about the same. It appears the savings are somehow limited to a certain number no matter what I do.

I have a hard time to figure out why I can't have a bigger saving. I read about terms like CPU-bound or IO-bound, but I don't know how I can practically tell whether that's what I'm running into. Is that something I can check from Activity Monitor? Or what's the suggested approach?

martineau
  • 119,623
  • 25
  • 170
  • 301
Joe Smith
  • 1,900
  • 1
  • 16
  • 15
  • If you're using the CPython implementation, you'll never see great improvements with multiprocessing, due to the GIL. See https://stackoverflow.com/questions/1294382/what-is-a-global-interpreter-lock-gil – John Gordon Oct 23 '17 at 22:10
  • 1
    Are you familiar with [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl's_law)? It's possible that your tasks do not have enough parallel code to achieve the performance you want. – Prune Oct 23 '17 at 22:13
  • 3
    @JohnGordon: Your mention of the GIL makes me think you're confusing multi-threading with multi-processing which doesn't have an issue with it. – martineau Oct 23 '17 at 22:18
  • 3
    @JohnGordon The GIL isn't relevant with multiprocessing. You can definitely get a big speedup with `multiprocessing` (if you have the right kind of workload). –  Oct 23 '17 at 22:20
  • 2
    Joe: You should know whether your program is CPU or I/O bound based on the nature of the processing it does and how it does it. If you can't tell for some reason, then profile it and see where it's spending most of its time. See [**How can you profile a script?**](https://stackoverflow.com/questions/582336/how-can-you-profile-a-script) for further information. – martineau Oct 23 '17 at 22:22
  • @Prune, I just read the wiki page for Amdahl's law. In my case, each task is handling a standalone set of files (not in anyway related to other tasks). Are there anything elseI should check? – Joe Smith Oct 23 '17 at 22:25
  • 1
    @JoeSmith: Yes! Complete your parallelization check: do you have the hardware resources to handle those file sets in parallel, or are they contending for I/O resources? Is "handling" a heavy processing load, where they'd compete for processor time? – Prune Oct 23 '17 at 22:41
  • 1
    If each task is handling a bunch of files, then things sound I/O bound. It'll be even worse if the files are all on the same device or accessed through the same networking hardware and software—and multi-processing will only be able to help so much. – martineau Oct 23 '17 at 22:53

0 Answers0