Why is my function that uses multiprocessing/multi threading so slow when used in a for loop, but not outside of it?

Question

I'm using multiprocessing and it's very quick on a single large number, but when i try to use it in a for loop, with small numbers, it takes 5 seconds to test just 1009 small numbers. But is magnificently fast for a single number outside a for loop. Is there any way to speed up the multiprocessing when a function that uses multiprocessing is in a for loop? I tried multi-threading and had the same issue, so i'm looking for some expert advice here. I included my code and the timings below for my multiprocessing version, but i can include my multi-threading version as well if needed:

import gmpy2 
import time
 
sinn = 2110229697309202254897383305762150945330987087513434511395506048950594976569434432057019507105035289374307720719984431280856161609820548842778454256113246763860786119268583367543952735347969627478873317341364209555365064365565504232770227619462128918701942169785585423104678142850200975026619010035331023744330713985615650556129731348659986462960062760308034462660525448390420668021248422741300646552941285862310410598374242189448623917196191138254637812716211329113836605859918549332304189053950819346551095911511755911832183789503704294770046935064469435830299623205136625543859303686699678929069468518950480476841246805908501510754550017255944080874819287974625925494008373883250410775902993163965873632474224574883242826458163446781002284368017611606202344050570737818087202137703099075773680753707346415849787963446390136517016131227807076254668461445862154978026041507116570585784569893773262639243954090283224759975513502582494002154146757110676408972377044584495342170277522887809749465855954126593100747444378301829661568735873345178089061677917127496915956539418931430313218084338374827152407795095072639044306222222695685778907958272820576498682506540189586657786292950574081739269257159839589987847266550007783514316481286222515710538845836151864127815058116482680058626451349913138908040817800742009650450811565324184631847563730941344941348929727603343965091116543702880556850922077216848669966268219928808236163268726995495688157209747596437162960244538054993785127947211290438554095851924381172697827312534174244295581184309147813790451951453564726742200569263225639113681905176376701339808868274637448606821696026703034737428319530072483125495383057919894902076566679023694181381398377144302767983385253577700652358431959604517728821603076762965129019244904679015099154368058005173028200266632883632953133017055122970338782493475762548347258351148037427739052271661340801912188203749647918379812483260399614599813650518046331670764766419886619324840045611486524123102046413946014624119568013100078163986683199814025915420877588778260860713148420321896163326473203441644820182490479899368048072263481024886708136521847014624735722333931331098969321911443978386868675912141648200500219168920887757573018380579532261231821382787339600631297820996466930957801607217549420247654458172818940238337170577825003408756362106088558651381993611741503374243481167926898332728164900189941804942580426055589622673679047058619682175301326905577843405270203660160407401675700528981573327582844330828861745574031416926871562443652858767649050943181353635950301154441954046214987718582670685455252774874198771086552440702483933126644594300464549471422237478151976561680719370424626162642534252062987911763456822609569209140676822858933588602318066530038691463577379331113471591913447226829868760176810195567325921301390329055242213842898142597360121925124635965685365925901913816717677946911762631634793638450106377437599347740569467683272089859392249351406815344105961234868327316964137925419770514177021722214309784062017826024217906664090209434553785436385765927274067126192143337589109608949427467825999057058702263715338956534536892852849984934736685814891286495169007648767081688963426768409476169071460997622740467533572971356017575900999100928776382541052696124463195981888715845688808970103527288822088031150716134784735332326775370417950625124642515148342694377095213470544739900830244879573205335578256682901821773047071352497997708791157012233232529777513203024818391621220967964874173106990772425289900446640237659116713251437567138729645677868024033209183367071421651937808005637679844370347367922676824239404492688418047080583797577102267329067247758368597488680401670673861120323439239792549053895366970423259196919428554146265587250617656401028722578111927104663315250291888502226235291264834968061065817079511872899991276288365723969841290984981957389126603952133124328219936785870274843554107325931034103072894378438818494802517594594270034007832922248742746517915210656205746338575621725899098414488628833412591266637224507533934158213117522503993423240638893845121918647788013

 
def ffs(x): 
    """Returns the index, counting from 0, of the 
    least significant set bit in `x`. 
    """ 
    return (x&-x).bit_length()-1 
    
    
 
def MillerRabin(arglist):  
  N = arglist[0] 
  primetest = arglist[1] 
  iterx = arglist[2] 
  powx = arglist[3] 
  withstats = arglist[4] 
  primetest = gmpy2.powmod(primetest, powx, N)  
  if withstats == True: 
     print("first: ", primetest)  
  if primetest == 1 or primetest == N - 1:  
    return True  
  else:  
    for x in range(0, iterx):  
       primetest = gmpy2.powmod(primetest, 2, N)  
       if withstats == True: 
          print("else: ", primetest)  
       if primetest == N - 1: return True  

       if primetest == 1: return False  
  return False  
    
   
def sfactorint_isprime(N, withstats=False): 
 
    N = gmpy2.mpz(N) 
    from multiprocessing import Pool 
 
    if N <= 1: return False 
    if N == 2: 
      return True 
    if N % 2 == 0: 
      return False 
    if N < 2: 
        return False 
         
    # Add Trial Factoring here to speed up smaller factored number testing 
 
     
    iterx = ffs(N-1) 
    """ This k test is an algorithmic test builder instead of using 
        random numbers. The offset of k, from -2 to +2 produces pow tests 
        that fail or pass instead of having to use random numbers and more 
        iterations. All you need are those 5 numbers from k to get a  
        primality answer.  
    """ 
    k = pow(N, -1, 1<<N.bit_length()) - 1 
    t = N >> iterx 
    tests = [k-2, k-1, k, k+1, k+2] 
     
    for primetest in range(len(tests)): 
      if tests[primetest] >= N: 
         tests[primetest] %= N 
   
    arglist = [] 
    for primetest in range(len(tests)): 
      if tests[primetest] >= 2: 
        arglist.append([N, tests[primetest], iterx, t, withstats]) 
      
    with Pool(5) as p: 
       s=p.map(MillerRabin, arglist)     
     
    if s.count(True) == len(arglist): return True 
    else: return False 
     
    return s 
    
start = time.time() 
xx = sfactorint_isprime(sinn) 
end = time.time() 
print (end-start)                                                                                                                                      

0.551032543182373


start = time.time() 
for x in range(1, 1009, 2): 
  xx = sfactorint_isprime(x) 
end = time.time() 
print (end-start)  
                                                                                                                                    
5.828385353088379

Why do you use `gmpy2`? Python already have a built-in large integer support. Note that the Python multi-threading do not improve the code because of the Global Interpreter Lock (GIL). — Jérôme Richard, Apr 14 '21 at 16:11
@JérômeRichard I used gmpy2 as it's pow is about 10-15x faster than python's pow. If you can't use gmpy2, Those statements can just be changed to pow, and the gmpy.mpz statements removed it that makes it easier to test. — oppressionslayer, Apr 14 '21 at 16:14

Jérôme Richard · Answer 1 · 2021-04-14T16:54:29.540

1

This line is very expensive: with Pool(5) as p:.

Indeed, it forks the Python process in several other ones (5 workers here). This system-side call (fork or CreateProcess) is expensive because the OS as to manage the memory of each newly-created process, the scheduling of those processes on the available cores, the (shared) hardware resources, etc. Not to mention the processes are also joined and destroyed at the end of the with scope (this is also expensive). Because this line is executed in a loop of about 500 iterations, it results in the creation of roughly 2500 processes. This should be much more costly than the actual computation. You should try to create processes and then share all the work only once. If this is not possible, you can use inter-process communication to do that. Here, it should be much better to parallelize the outer loop in the second case (for x in range(1, 1009, 2)). Note that Python is not the best tool for parallel computation (this is cumbersome to do and eventually not very efficient).

On mainstream Windows platforms, an additional very-unexpected behaviour happens: the default included Antivirus software analyzes each newly created process to check if they are malicious. This deep analysis is very expensive and totally useless in this specific case. Not to mention that creating a new process is already expensive on Windows.

edited Apr 14 '21 at 16:54

answered Apr 14 '21 at 16:47

Jérôme Richard

41,678
6
29
59

Thanks for the answer! I was wondering if moving to asyncio would help in this scenario, do you know if it has the same limitations, does it open new processes? – oppressionslayer Apr 14 '21 at 17:09
AFAIK, you can create one pool of process and then manage your task with asyncio. This should be much faster than the current code for the second case, but the communication overhead of asyncio is far from being free. Thus, you should still be very careful using it. Moreover, note that multiprocessing-based asyncio does not seem supported on Windows. If you are on an Unix-like platform, you can give a look to https://stackoverflow.com/questions/38193596. – Jérôme Richard Apr 14 '21 at 17:27
Ok, i changed to asyncio, and it was much faster, but still much slower than the non parallel version for small numbers (the larger numbers seem to take so much time to finish anyway that there isn't really an I/O hit). I'm converting my code to c now, but i'm wondering if i will hit on the same issues. – oppressionslayer Apr 14 '21 at 23:07
Yeah, as expected sadly. C provide much more opportunity for optimizations and efficient parallelisation. You can give a look to *OpenMP* to parallelize your code. Note that there will always be an overhead introduced by using multiple threads and processes, but native thread-based communication (in shared-memory) is generally much faster than inter-process Python communications (at least few orders of magnitude). – Jérôme Richard Apr 15 '21 at 00:26

Why is my function that uses multiprocessing/multi threading so slow when used in a for loop, but not outside of it?

1 Answers1