Parallelize process of generating combinations

Question

The Problem

I need to create a list of combinations of items from a master list of length n. With a small number of items in the master list, this can be done without parallelization and happen quickly. However, when I try to use the multiprocessing library to parallelize the generation of the list, it seemingly takes longer. Here's an example:

Pretend I have a list of items (I'll use ice cream flavors) called item_list

item_list = ['chocolate', 'strawberry', 'vanilla', 'pineapple']

I want to generate all of the combinations of the flavors, with combinations of size n_items, so I wrote a function to do this:

import itertools
def make_all_combinations_n_items(item_list, n_items):
    out = []
    for combo in itertools.combinations(item_list, n_items):
        out.append(combo)

    return out

And if I call it for size n_items = 2, it produces the list of tuples quickly.

make_combinations_n_items(item_list, 2)

I know that when the number of items grows, the number of possible combinations grows as 2ⁿ, and I want to parallelize this process to generate the combinations faster (essentially have each core work on a different values of n_items). I have 4 cores available.

To try to parallelize it, I used the multiprocessing library, guided by this example, as follows:

import multiprocessing as mp

pool = mp.Pool(mp.cpu_count())

new_combos = [pool.apply(
    make_all_combinations_n_items,
    args = (item_list, n_items))
    for n_items in range(1, len(item_list) + 1)
]

pool.close()

The process doesn't happen nearly as quickly, and in fact, I can't tell if the process is working at all. When I copied the example code I was following and ran it, I had the same results.

Questions

I have two questions:

1) Is this the proper way to parallelize this function? Or is there a better/more efficient/faster way?

2) Is there a better/faster/more efficient way to produce all of these combinations?

I can provide more information as needed. Thanks in advance for the help.

There are a bunch of duplicate questions here and [one comment seems to nail it](https://stackoverflow.com/questions/42468136/itertool-and-multiprocessing-how-can-i-generate-all-possible-combinations-in-pa#comment72077355_42468136), which is that combinations increases exponentially but parallelism only increases linearly, so it's dubious that it'll help a whole lot. [1](https://stackoverflow.com/questions/25909189) [2](https://stackoverflow.com/questions/41802459) [3](https://stackoverflow.com/questions/42468136) [4](https://cs.stackexchange.com/questions/10899/) etc — ggorlen, Aug 22 '19 at 17:53
Possible duplicate of [parallelizing combinations python](https://stackoverflow.com/questions/25909189/parallelizing-combinations-python) — ggorlen, Aug 22 '19 at 17:54
What is your "target" sizing? For what **`n`**-s do you seek for some faster processing-times? Given 4-CPU-cores, complete your free-RAM available and the target sizing **`n`** --- without all these facts the problem is undecidable. ( Problem is by definition ***not*** a true-`[PARALLEL]`-problem, **yet it may help a lot to understand the performance-motivated design** rules ) — user3666197, Aug 22 '19 at 18:08
@ggorlen I actually read through that before posting, but didn't think my question was a duplicate of it. I understand that parallelism won't increase speed at the same rate that combinations grow (I have other code I didn't post that handles longer lists). My question was more prompted by the fact that it didn't seem like the parallelized code posted above actually ran, let alone ran faster. When copying the [example code](https://www.machinelearningplus.com/python/parallel-processing-python/) I was following to ensure _that_ ran, I found it didn't and can't figure out why that is. — rossdrucker9, Aug 22 '19 at 18:10
@user3666197 The input list (number of flavors in above example) can change based on other parameters elsewhere in the code. My main motive for wanting to parallelize is to accelerate this process and use 4 cores at full speed rather than only using 1. The 1-core method I have works quickly for small samples (`n < 20`), but the parallelized version doesn't seem to work, let alone work any faster. — rossdrucker9, Aug 22 '19 at 18:14
OK, fair enough but the `make_all_combinations_n_items` function doesn't make much sense to me. How is MP supposed to parallelize this? It's just a wrapper and each process is calling `itertools.combinations` independently from the others, causing Nx repeated work where N is the number of processors. The only way to parallelize the function that's doing the work is to, well, actually parallelize the function that's doing the work, which is `itertools.combinations` itself, meaning, it needs to be rewritten to be parallel. But as user3666197 says, this isn't really trivial. — ggorlen, Aug 22 '19 at 18:22
The OP wants to parallelize different calls to `make_all_combinations_n_items` with a different `n_item` for each. — Thierry Lathuille, Aug 22 '19 at 18:25
@ThierryLathuille is correct; this is what I'm trying to do. — rossdrucker9, Aug 22 '19 at 18:29
Ah, OK. I totally misunderstood--likely the title threw me off. I still don't see the point of `make_all_combinations_n_items`. It's shorter and probably faster just to say `list(combinations(items, n))`, no? But it seems your code should do what you want and I'm not sure I see how you can speed it up short of parallelizing combinations itself. As you said, MP can only help so much with O(2^n) algorithms. — ggorlen, Aug 22 '19 at 18:46
To me it seems the "parallelized" version is doing a whole lot of work that it shouldn't. As written, it will generate all combinations of `n = 1, 2, 3...` and so on items. According to the description you are only interested in a single `n` value. Is that correct? If so, you need to find another way to partition your problem to be able to leverage multiprocessing capabilities. — JohanL, Aug 22 '19 at 19:18
**Please** do respect people,who ask details,that help you receive a solution. Having asked about **--2--** params(free **RAM** +target ranges for **n**)it's fair to get **both parameters answered,isn't it?** RAM completely missing(ignored so far) and re-telling the story about fast small n<20 **was neither asked about,nor necessary** as it was already posted in O/P, **yet the explicitly asked target values of n** -be it any of [1E+3,1E+6,1E+9,1E+12,1E+xyz], **were ignored** either,**in spite of the warning,that without both of values your problem is undecidable** People tried to help you... — user3666197, Aug 22 '19 at 19:24
@user3666197 I'm sorry for ignoring the RAM portion of your question, I'm not sure I entirely understood what you were asking about it. If I may, I'll rephrase my question again as "How can I call a function to use all available cores on my machine, and pass a different input on each core?" As far as target values of `n`, I don't have a target as it can vary based on other parameters, but it should normally be in the 15-30 range. If I can clarify anything further or haven't sufficiently answered part of your question, please let me know. — rossdrucker9, Aug 22 '19 at 19:36
@JohanL What I'm trying to do is process `n = 1` on processor 1, `n = 2` on processor 2, etc. I do need all values of `n`, but I was hoping that by running this process across all 4 available cores, I'd be able to do the serialized process in one quarter of the time. — rossdrucker9, Aug 22 '19 at 19:38
@rossdrucker9 I tried your code, but instead of `pool.apply` I used `pool.map` - what is killing the performance is this `return out` - every process has to pickle the data and send them back, that's why it's slow — Andrej Kesely, Aug 22 '19 at 20:10
@AndrejKesely Can you clarify the difference between `pool.map` and `pool.apply` and how you adjusted the code? Also, if I need the combinations that it produced, how do I retrieve them from the function calls if not from a return statement? — rossdrucker9, Aug 22 '19 at 20:18
@rossdrucker9 `pool.apply` blocks until the the command is finished. More info here https://stackoverflow.com/questions/8533318/multiprocessing-pool-when-to-use-apply-apply-async-or-map About getting the results, I'm not sure. Depends what do you want to do with the results afterwards I think. — Andrej Kesely, Aug 22 '19 at 20:24
@AndrejKesely At the end of this phase, I need to take the combinations generated and put them in a data frame. So long as I can get the list of generated combinations from this step, I'm good to go from there. — rossdrucker9, Aug 22 '19 at 20:47
@rossdrucker9 I'm not sure how to speed up the process. Comparing the speed of generating of one combination to pickle this combination, send the combination to parent process, unpickle the combination - seems multiprocessing isn't worth it. Maybe you if you can send some more work to the process, not just generate the combination, it will be worth it. — Andrej Kesely, Aug 22 '19 at 20:53

user3666197 · Accepted Answer · 2019-08-26T16:23:59.300

Q : 1) Is this the proper way to parallelize this function? Or is there a better/more efficient/faster way?
Q : 2) Is there a better/faster/more efficient way to produce all of these combinations?

Great case for demonstrating a true double-trouble-Hell inside the Complexity-ZOO :

all might have already detected the actual complexity here grows approx. O( ~n!/k!/(n-k)! ) in [SPACE] (RAM) and similarly approx. O( ~n! / k! / (n-k)! ) [TIME] but, jumps into 100.000 x slower access times are near to come and will ... as the [SPACE] will cease to provide RAM enough for a pure in-RAM-computing ( one can easily hang 64-bit O/S just by exhausting memory-management by the ill-designed producing and storing combinatorics data just on N-tuples from as few as 100-Candidates.

While we may, and will, unload as much as possible from standard python overheads, using hashed-tricks and smart-iterators with processing set()-s instead of list-manipulations ( list is expensive here, as it has to maintain the order, which combinations do not ), still the RAM consumption will drive in both the [SPACE] and [TIME] dimensions the resulting costs.

Selecting pairs from just 10 candidates takes almost no RAM in [SPACE] and costs ~ 30 [us] in [TIME]:

>>> from zmq import Stopwatch     # Stopwatch methods {.start()|.stop()} used
>>> aClk = Stopwatch()            # for benchmarking ~ [us]-resolution timer
>>> import itertools              # for .combinations()
>>>
>>> aSet = set( range( int( 1E1 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
30
30
30
53
30

Selecting triples, again from 10 candidates takes almost the same:

>>> aSet = set( range( int( 1E1 ) ) )
>>> aClk.start(); _ = [ i for i in itertools.combinations( aSet, 3 ) ]; aClk.stop()
56
54
53
52
51

Now, the costs of moving data into `[SPACE]` start to matter, and a lot :

Selecting pairs from about a hundred candidates is still rather cheap 1.3 ~ 1.6 [ms] = 1300 ~ 1600 [us], producing almost ~ 5000 legal pairs:

>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
1963
1591
1322
1326
1547
1601

Selecting triples goes into the range of about ~ 33 ~ 35 [ms] as it produces and stores about ~ 162.000 legal triples :

>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,3)];aClk.stop()
34741
33505
35580
35932
32989
33434

Selecting quartets will get us to ~ 600 ~ 730 [ms] as it must store all ~ 3.921.225 legal quartets :

>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,4)];aClk.stop()
592453
723026
721529
729759
699055
704901

Selection of 5-tuples from just a hundred candidates will need 8 GB of RAM ( or devastating swaps will start to appear at an additional costs of about 100.000x slower processing, compared to an in-RAM case ), still bearing just some ~ 452.000.000 [us] to generate and store ~ 75.287.520 legal 5-tuples of cheapest data-type of small int-s in about 6.1 GB:

>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,5)];aClk.stop()
451758912

CPU-monitor confirms this, showing not more than about 30% - 40% CPU-loads the problem soon goes into a memory-[SPACE]-bound problem, where [TIME] costs are related more to the memory management and memory allocation and mem-I/O, than to the nature of calculations.

If in this case one would try to instantiate a-few-( the worse the many-)-[CONCURRENT] processes to work ( the thread-based attempt will not help here, due to the python central GIL-locks, that is known to re-[SERIAL]-ise all threaded-work, that actually prevents any form of concurrent-processing and just corner-cases, where latency-masking may help will generate any positive reward for spending efforts on going multi-threaded in standard python ( plus - beware the costs of all multiprocessing-related costs of instantiations - right - again the [SPACE]-related and [TIME]-related costs matter times how many instances one tries to spawn ( and how ) )

If one would try to split the work ( which may seem attractive ) for the sake of reducing the problem of N-tuples into just a coordinated generation of (N-1)-tuples over 4-CPU-cores in a just-[CONCURRENT]-pipeline fashion, the expected acceleration ( going (N-1)-tuples generation instead of the full-scale N-tuples ) will cost immense add-on costs on managing the re-assembly of partial results, coming at still remarkable [SPACE]-size from the just-[CONCURRENT]-pipeline and memory-management costs are expensive to pay, if just re-joining the partial results, during still many-to-run memory-bound processing-tasks are waiting to get processed through the artificially split-join 4-CPU-cores orchestrated work-flow.

A few hard facts, demonstrated above on N-tuples created from as few as 100-Candidates are a great opportunity to test any better approach, if there is any such, to beat the double-trouble in the [SPACE]-and-[TIME]-bound factorial-scaled school-book problem.

Best regards from me to your Computing Science professor to make you go into this corner of the Complexity-ZOO Hell ... and make your hands dirty.

RESUME :

While for small sizes ( tuples, triples ) both the [SPACE] and [TIME] costs, that have been shown for the pure-[SERIAL] processing, are so small, if not negligible, it will never justify an attempt to spawn a new 4-[CONCURRENT]-processes ( at remarkable add-on costs ) to harness all 4-CPU-cores, and next to accommodate additional expensive add-on costs for re-assembling results from "distributed"-processes back to the main python process.

This will never justify the lump sum of all the add-on costs.

Smart, vectorised code, like numpy, can "cooperate" in non-colliding operations with multi-core productivity, yet having close-to-zero add-on costs, as it does not instantiate separate processes to transfer the jobs for "remote"-execution, but rather uses low-level language coded functions, that work on multi-core without ever transferring any data to remote-processes, but rather letting other CPU-cores smart-compute independently right on the original data multiple regions, using smart-aligned cache-related benefits from doing it inside the pre-fetched coherent memory blocks.

Having larger sizes, where computing might somewhat enjoy a split-work, the standard python code immense [SPACE]-related costs will soon devastate any heavily-sub-linear speedup, due to the add-on costs, mostly from results re-assembly back to the main python process plus all the 4-[CONCURRENT]-work-units will devastatingly compete for a shared-pool of resources ( The RAM ), during their split-computing-"ride", again at an immense adverse costs of memory-management ( not to mention swap-originated system suffocation, that may easily hangup all the O/S efforts to fair-trade resources among all the asking processes and obeying all the system-designed priority schedules )

Last but not least - a BONUS part :

For syntax, check this.

For ideas on performance tuning this.

For benchmarking the add-on costs for transferring data to/from spawned processes this ( benchmarking facts were surprisingly hated there, so do not get distracted by hysteria of minus-votes, the benchmarking code for cases B, C and D is the important value to re-use ).

For even more details about proper accounting of all add-on costs of instantiating more processes and about ( parameters there + results back )-communications costs and about impacts of atomicity-of-work on final, realistically achievable speedups, do not miss and feel free to read about contemporary criticism of overhead naive Amdahl's law

Parallelize process of generating combinations

1 Answers1

Great case for demonstrating a true double-trouble-Hell inside the Complexity-ZOO :

Now, the costs of moving data into [SPACE] start to matter, and a lot :

RESUME :

Last but not least - a BONUS part :

Now, the costs of moving data into `[SPACE]` start to matter, and a lot :