Thanks for interesting question!
Right now I wrote from scratch two versions of classical Sieve of Eratosthenes.
One is single-core (CPU core), another one is multi-core (using all CPU cores).
Main speedup of both (single and multi core) versions was due to using Numpy, official Python package. Install in once through command python -m pip install numpy
. Another great speedup of multi-core version was due to usage of all CPU cores, which gives almost speedup N times for N cores, compared to single core version.
Right now I have only two-core laptop. My program produced following timings:
Computing primes less than 8M
Number of CPU cores 2
Original time 72.57 sec
Simple time 5.694 sec
Simple boost (compared to original) 12.745x
Multi core time 2.642 sec
Multi core boost (compared to simple) 2.155x
Original
above means your code from your question's body, the second one that you optimized already. Simple
is my single core version. Multi
is my multi core version.
As you see above computing primes less than 8 Million took 72 seconds on your original code, my single core version took 5.7 seconds, which is 12.7x times faster than your code, and my 2-core version took 2.6 seconds, which is 2.1x times faster than my single core and 27x times faster than your original code.
In other words I got 27x times speedup in my multi-core code compared to your code, really a lot! And this is only on 2-core laptop. If you have 8 cores or more then you'll get much bigger speedup. But remember that real speedup on multi core machine will be only for quite big prime number limit, try 64 Million limit or bigger, for this modify line primes_limit_M = 8
in my code to set amount of Millions.
Just to dwell into details. My single core version is almost like your code, but uses Numpy which makes any array operations very fast, instead of using pure pythonic loops with lists.
Multi core version also uses Numpy, but splits array with sieved range into as many pieces as there are CPU cores on your machine, each piece of array having equal size. Then every CPU core sets boolean flags in its own part of array. This technique gives speedup only till you hit speed limit of your memory (RAM), so after some point with growth of amount of CPU cores you don't get extra speedup.
By default I use all CPU cores in multi core version, but you may experiment by setting less cores than you have on your machine, this may give even better speedup, because it is not guaranteed that most of cores will give exactly fastest result. Tweak amount of cores through changing line cpu_cores = mp.cpu_count()
to something like cpu_cores = 3
.
Try it online!
def SieveOfEratosthenes_Original(end):
import numpy as np
limit = end - 1
limitn = limit+1
primes = dict()
for i in range(2, limitn): primes[i] = True
for i in primes:
factors = range(i,limitn, i)
for f in factors[1:]:
primes[f] = False
return np.array([i for i in primes if primes[i]==True], dtype = np.uint32)
def SieveOfEratosthenes_Simple(end):
# https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
import numpy as np
composites = np.zeros((end,), dtype = np.uint8)
composites[:2] = 1
for p, c in enumerate(composites):
if c:
continue
composites[p * p :: p] = 1
return np.array([p for p, c in enumerate(composites) if not c],
dtype = np.uint32)
def SieveOfEratosthenes_Task(small_primes, begin, end):
import numpy as np
composites = np.zeros((end - begin,), dtype = np.uint8)
offsets = np.full((len(small_primes),), begin, dtype = np.uint32)
offsets = small_primes - offsets % small_primes
offsets[offsets == small_primes] = 0
for off, p in zip(offsets, small_primes):
composites[off :: p] = 1
return np.array([begin + i for i, c in enumerate(composites) if not c],
dtype = np.uint32)
def SieveOfEratosthenes_MultiCore(end, *, nthreads = None):
import math, multiprocessing as mp, numpy as np
end_small = math.ceil(math.sqrt(end)) + 1
small_primes = SieveOfEratosthenes_Simple(end_small)
if nthreads is None:
nthreads = mp.cpu_count()
block = (end - end_small + nthreads - 1) // nthreads
with mp.Pool(nthreads) as pool:
return np.concatenate([small_primes] + pool.starmap(SieveOfEratosthenes_Task, [
(small_primes, min(end_small + ithr * block, end), min(end_small + (ithr + 1) * block, end))
for ithr in range(nthreads)]))
def Test():
import time, numpy as np, multiprocessing as mp
primes_limit_M = 8
cpu_cores = mp.cpu_count()
end = primes_limit_M * 2 ** 20
print(f'Computing primes less than {primes_limit_M}M')
print('Number of CPU cores', cpu_cores, flush = True)
tim_orig = time.time()
res_orig = SieveOfEratosthenes_Original(end)
tim_orig = time.time() - tim_orig
print('Original time', round(tim_orig, 3), 'sec', flush = True)
tim_simple = time.time()
res_simple = SieveOfEratosthenes_Simple(end)
tim_simple = time.time() - tim_simple
print('Simple time', round(tim_simple, 3), 'sec', flush = True)
assert np.all(res_orig == res_simple)
print(f'Simple boost (compared to original) {tim_orig / tim_simple:.3f}x')
tim_multi = time.time()
res_multi = SieveOfEratosthenes_MultiCore(end, nthreads = cpu_cores)
tim_multi = time.time() - tim_multi
print('Multi core time', round(tim_multi, 3), 'sec', flush = True)
assert np.all(res_simple == res_multi)
print(f'Multi core boost (compared to simple) {tim_simple / tim_multi:.3f}x')
if __name__ == '__main__':
Test()