33

I have a parallelized task that reads stuff from multiple files, and writes it out the information to several files.

The idiom I am currently using to parallelize stuff:

listOfProcesses = []
for fileToBeRead in listOfFilesToBeRead:
    process = multiprocessing.Process(target = somethingThatReadsFromAFileAndWritesSomeStuffOut, args = (fileToBeRead))
    process.start()
    listOfProcesses.append(process)

for process in listOfProcesses:
    process.join()

It is worth noting that somethingThatReadsFromAFileAndWritesSomeStuffOut might itself parallelize tasks (it may have to read from other files, etc. etc.).

Now, as you can see, the number of processes being created doesn't depend upon the number of cores I have on my computer, or anything else, except for how many tasks need to be completed. If ten tasks need to be run, create ten processes, and so on.

Is this the best way to create tasks? Should I instead think about how many cores my processor has, etc.?

bzm3r
  • 3,113
  • 6
  • 34
  • 67
  • It's certainly not the more processes the better. But another thing you should think about is whether creating additional processes makes sense at all. Unless you do heavy (CPU intensive) processing on those files, this kind of thing may very well be I/O-limited. In such case, Python threads will do well enough. – Thijs van Dien May 22 '14 at 20:47
  • Noob answer: start from 1-CPU_CORES and increase till the CPU is maxed (or other dependencies, if your working with I/O) – Ali Pardhan Apr 18 '21 at 18:01

1 Answers1

58

Always separate the number of processes from the number of tasks. There's no reason why the two should be identical, and by making the number of processes a variable, you can experiment to see what works well for your particular problem. No theoretical answer is as good as old-fashioned get-your-hands-dirty benchmarking with real data.

Here's how you could do it using a multiprocessing Pool:

import multiprocessing as mp

num_workers = mp.cpu_count()  

pool = mp.Pool(num_workers)
for task in tasks:
    pool.apply_async(func, args = (task,))

pool.close()
pool.join()

pool = mp.Pool(num_workers) will create a pool of num_workers subprocesses. num_workers = mp.cpu_count() will set num_workers equal to the number of CPU cores. You can experiment by changing this number. (Note that pool = mp.Pool() creates a pool of N subprocesses, where N equals mp.cpu_count() by default.)

If a problem is CPU-bound, there is no benefit to setting num_workers to a number bigger than the number of cores, since the machine can't have more processes operating concurrently than the number of cores. Moreover, switching between the processes may make performance worse if num_workers exceeds the number of cores.

If a problem is IO-bound -- which yours might be since they are doing file IO -- it may make sense to have num_workers exceed the number of cores, if your IO device(s) can handle more concurrent tasks than you have cores. However, if your IO is sequential in nature -- if, for example, there is only one hard drive with only one read/write head -- then all but one of your subprocesses may be blocked waiting for the IO device. In this case no concurrency is possible and using multiprocessing in this case is likely to be slower than the equivalent sequential code.

shellcat_zero
  • 1,027
  • 13
  • 20
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • If you know of a way to predict performance without benchmarking, more power to you. I simply do not know of any reliable method of prediction without testing. – unutbu May 22 '14 at 21:07
  • 3
    Ah, I think I understand. I do not want to predict performance as much as learn about stuff like "If a problem is CPU-bound...", "If a problem is IO-bound...", etc. – bzm3r May 22 '14 at 21:12
  • If my problem is IO-bound, is there a point where making lots of processes becomes counter-productive again? If yes, why? Those are the sorts of questions I am interested in understanding :) "Heuristics" – bzm3r May 22 '14 at 21:16
  • 3
    Think about what happens when you write to a file. The subprocess executes `f.write`, and waits for the IO to complete. Your OS may switch that core to another subprocess. That subprocess may or may not have something useful to do. Suppose it also want to perform IO. If you have multiple disks with multiple read/write heads, perhaps it can also initiate some useful work. While it blocks waiting for its IO to complete, the OS can switch that core to yet another subprocess. You will continue to reap a performance gain until your *IO devices* can no longer handle more concurrent tasks. – unutbu May 22 '14 at 21:52
  • 2
    Tangential, but worth noting: if the function involved in the top-level multiprocessing may spawn processes of its own, you'll need to subclass `multiprocessing.pool.Pool` to make the first pool's processes non-daemonic. See [this answer](http://stackoverflow.com/a/8963618/2069350). – Henry Keiter May 22 '14 at 22:12
  • I am sort of ok with a heuristic that is likely to work most of the time. I usually have 1 processor running the main script and 1 other one doing the pre-fetching data in the dataloader with pytorch. So my guess is that doing `mp.cpu_count() - 2` (or slightly less) might be a good idea. Do you have any comments on this reasoning? I don't care that things are perfect. I care that I get a speed up but not making `num_procs` too large that it's hindering me. – Charlie Parker Feb 16 '21 at 17:40