Q : 1) Is this the proper way to parallelize this function? Or is there a better/more efficient/faster way?
Q : 2) Is there a better/faster/more efficient way to produce all of these combinations?
Great case for demonstrating a true double-trouble-Hell inside the Complexity-ZOO :
all might have already detected the actual complexity here grows approx. O( ~n!/k!/(n-k)! )
in [SPACE]
(RAM) and similarly approx. O( ~n! / k! / (n-k)! )
[TIME]
but, jumps into 100.000 x slower access times are near to come and will ... as the [SPACE]
will cease to provide RAM enough for a pure in-RAM-computing ( one can easily hang 64-bit O/S just by exhausting memory-management by the ill-designed producing and storing combinatorics data just on N-tuples from as few as 100-Candidates.
While we may, and will, unload as much as possible from standard python overheads, using hashed-tricks and smart-iterators with processing set()
-s instead of list
-manipulations ( list
is expensive here, as it has to maintain the order, which combinations do not ), still the RAM consumption will drive in both the [SPACE]
and [TIME]
dimensions the resulting costs.
Selecting pairs from just 10 candidates takes almost no RAM in [SPACE]
and costs ~ 30 [us]
in [TIME]
:
>>> from zmq import Stopwatch # Stopwatch methods {.start()|.stop()} used
>>> aClk = Stopwatch() # for benchmarking ~ [us]-resolution timer
>>> import itertools # for .combinations()
>>>
>>> aSet = set( range( int( 1E1 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
30
30
30
53
30
Selecting triples, again from 10 candidates takes almost the same:
>>> aSet = set( range( int( 1E1 ) ) )
>>> aClk.start(); _ = [ i for i in itertools.combinations( aSet, 3 ) ]; aClk.stop()
56
54
53
52
51
Now, the costs of moving data into [SPACE]
start to matter, and a lot :
Selecting pairs from about a hundred candidates is still rather cheap 1.3 ~ 1.6 [ms] = 1300 ~ 1600 [us]
, producing almost ~ 5000 legal pairs:
>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
1963
1591
1322
1326
1547
1601
Selecting triples goes into the range of about ~ 33 ~ 35 [ms]
as it produces and stores about ~ 162.000
legal triples :
>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,3)];aClk.stop()
34741
33505
35580
35932
32989
33434
Selecting quartets will get us to ~ 600 ~ 730 [ms]
as it must store all ~ 3.921.225
legal quartets :
>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,4)];aClk.stop()
592453
723026
721529
729759
699055
704901
Selection of 5-tuples from just a hundred candidates will need 8 GB of RAM ( or devastating swaps will start to appear at an additional costs of about 100.000x slower processing, compared to an in-RAM case ), still bearing just some ~ 452.000.000 [us]
to generate and store ~ 75.287.520
legal 5-tuples of cheapest data-type of small int
-s in about 6.1 GB
:
>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,5)];aClk.stop()
451758912
CPU-monitor confirms this, showing not more than about 30% - 40%
CPU-loads
the problem soon goes into a memory-[SPACE]
-bound problem, where [TIME]
costs are related more to the memory management and memory allocation and mem-I/O, than to the nature of calculations.
If in this case one would try to instantiate a-few-( the worse the many-)-[CONCURRENT]
processes to work ( the thread-based attempt will not help here, due to the python central GIL-locks, that is known to re-[SERIAL]
-ise all threaded-work, that actually prevents any form of concurrent-processing and just corner-cases, where latency-masking may help will generate any positive reward for spending efforts on going multi-threaded in standard python ( plus - beware the costs of all multiprocessing
-related costs of instantiations - right - again the [SPACE]
-related and [TIME]
-related costs matter times how many instances one tries to spawn ( and how ) )
If one would try to split the work ( which may seem attractive ) for the sake of reducing the problem of N-tuples into just a coordinated generation of (N-1)-tuples over 4-CPU-cores in a just-[CONCURRENT]
-pipeline fashion, the expected acceleration ( going (N-1)-tuples generation instead of the full-scale N-tuples ) will cost immense add-on costs on managing the re-assembly of partial results, coming at still remarkable [SPACE]
-size from the just-[CONCURRENT]
-pipeline and memory-management costs are expensive to pay, if just re-joining the partial results, during still many-to-run memory-bound processing-tasks are waiting to get processed through the artificially split-join 4-CPU-cores orchestrated work-flow.
A few hard facts, demonstrated above on N-tuples created from as few as 100-Candidates are a great opportunity to test any better approach, if there is any such, to beat the double-trouble in the [SPACE]
-and-[TIME]
-bound factorial-scaled school-book problem.
Best regards from me to your Computing Science professor to make you go into this corner of the Complexity-ZOO Hell ... and make your hands dirty.
RESUME :
While for small sizes ( tuples, triples ) both the [SPACE]
and [TIME]
costs, that have been shown for the pure-[SERIAL]
processing, are so small, if not negligible, it will never justify an attempt to spawn a new 4-[CONCURRENT]
-processes ( at remarkable add-on costs ) to harness all 4-CPU-cores, and next to accommodate additional expensive add-on costs for re-assembling results from "distributed"-processes back to the main python process.
This will never justify the lump sum of all the add-on costs.
Smart, vectorised code, like numpy
, can "cooperate" in non-colliding operations with multi-core productivity, yet having close-to-zero add-on costs, as it does not instantiate separate processes to transfer the jobs for "remote"-execution, but rather uses low-level language coded functions, that work on multi-core without ever transferring any data to remote-processes, but rather letting other CPU-cores smart-compute independently right on the original data multiple regions, using smart-aligned cache-related benefits from doing it inside the pre-fetched coherent memory blocks.
Having larger sizes, where computing might somewhat enjoy a split-work, the standard python code immense [SPACE]
-related costs will soon devastate any heavily-sub-linear speedup, due to the add-on costs, mostly from results re-assembly back to the main python process plus all the 4-[CONCURRENT]
-work-units will devastatingly compete for a shared-pool of resources ( The RAM ), during their split-computing-"ride", again at an immense adverse costs of memory-management ( not to mention swap-originated system suffocation, that may easily hangup all the O/S efforts to fair-trade resources among all the asking processes and obeying all the system-designed priority schedules )
Last but not least - a BONUS part :
For syntax, check this.
For ideas on performance tuning this.
For benchmarking the add-on costs for transferring data to/from spawned processes this ( benchmarking facts were surprisingly hated there, so do not get distracted by hysteria of minus-votes, the benchmarking code for cases B, C and D is the important value to re-use ).
For even more details about proper accounting of all add-on costs of instantiating more processes and about ( parameters there + results back )-communications costs and about impacts of atomicity-of-work on final, realistically achievable speedups, do not miss and feel free to read about contemporary criticism of overhead naive Amdahl's law