Python generator and set(generator) get different results

Question

I have code like below

def yield_multiple():
    for prime in prime_list:
        for multiple in range(prime+prime, end, prime):
            yield multiple

And I use this to get the prime numbers

multiple_set = set(yield_multiple())
result = [v for v in candidate_list if v not in multiple_set]

And I meet the memory error when the set is very large, so I was thinking to use this to save the memory

result = [v for v in candidate_list if v not in yield_multiple()]

But this will get the wrong result. So, How to avoid memory error to get the prime numbers correctly?

Here's my improved solution without too much memory to use.

import math
import sys

import time
from mpi4py import MPI

import eratosthenes

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

TAG_RESULT = 0

n = sys.argv[1]
if n.isdigit():
    start_time = time.time()
    n = int(n)
    sqrt_n = int(math.sqrt(n))

    task_per_block = int(math.ceil((n - 1) / size))
    begin = 2 + rank * task_per_block
    end = begin + task_per_block if begin + task_per_block <= n + 1 else n + 1
    if rank == 0:
        begin = sqrt_n if sqrt_n < end else begin
    sieve_list = [True] * (end - begin)
    prime_list = eratosthenes.sieve(sqrt_n)

    if rank == 0:
        result = sum(prime_list)
        for prime in prime_list:
            start = begin if begin % prime == 0 else (int(begin / prime) + 1) * prime
            for multiple in range(start, end, prime):
                sieve_list[multiple - begin] = False
        result += sum(i + begin for i, v in enumerate(sieve_list) if v)
        result_received = 0
        while result_received < size - 1:
            data = comm.recv(source=MPI.ANY_SOURCE, tag=TAG_RESULT)
            result += data
            result_received += 1
        print(result)
        print(time.time() - start_time)
    else:
        for prime in prime_list:
            start = begin if begin % prime == 0 else (int(begin / prime) + 1) * prime
            for multiple in range(start, end, prime):
                sieve_list[multiple - begin] = False
        result = sum(i + begin for i, v in enumerate(sieve_list) if v)
        comm.send(result, dest=0, tag=TAG_RESULT)

Why do you do `for multiple in range(prime, end, prime)` when you are trying to extract prime numbers? You may want to take a look at the Sieve or Eratosthenes algorithm. — cs95, Mar 07 '18 at 10:49
@cᴏʟᴅsᴘᴇᴇᴅ Yes, I knew Sieve of Eratosthenes. But I was trying to do a disturbed version of Sieve of Eratosthenes. So this is a pretty way to get all multiple for a given range — Blaise Wang, Mar 07 '18 at 10:55
for a distributed sieve you may need to merge the streams of multiples in a tree, e.g. like [the one here](https://stackoverflow.com/a/12563800/849891). — Will Ness, Mar 12 '18 at 15:02
@WillNess You are right. It should be `for multiple in range(prime+prime, end, prime)` — Blaise Wang, Mar 14 '18 at 05:34
@cᴏʟᴅsᴘᴇᴇᴅ enumerating the multiples as arithmetic sequences of primes is exactly the essence of the Sieve of Eratosthenes though. — Will Ness, Mar 14 '18 at 14:27
@WillNess I think last time I forget to add [2, 3] at the beginning. Now, it is correct. thanks for your help[https://tio.run/##fZLPboMwDMbveYocSYuq/tF2qMZ1T7AbYigr7oiUBBYCU5@eOUmBUFXLCZzP9s@f095s3ejTOArVNsZSxW1NiKYZPez3e0JIBVd64boSFbdQGq6/IZFCCcvOhOIZuOwB5S/@T@iLAQXaYuToI7@1kHBXvVGfSbf0EJLduQmQVRDMsSDfZku9@Wrp8JnR1ztgJ2BYY7VGKCil6BxJfkzpqfBxr5zjH6aHgm5oMnMxr7o2hgK/1KXu1RcYbPqvB57rGtXOo@Ri0azBdrxtQVdJpGUrqaNQvbSiRQcRITSOuTYxZbq4m8ZxtgZYu5BPDQr0453LLizBgO2NjmDvRvtllVNOgvhnJJt8cMA@xdGGneBl4t7Urvsx1ukZi3ieTxhKbEMpHEVXafh@mCS8nCkfCQ10@DMvHEvkg@8xPNugZm5pA9WNddcPo2lWEGyK@BJ0EiozRsbxDw] — Blaise Wang, Mar 14 '18 at 14:47
@WillNess I've got a better solution now. You can check it above. It will be 20x faster than the single core when I am using 48 cores — Blaise Wang, Mar 14 '18 at 15:12
you're most welcome. --- instead of adding it to the question, I think you should post it as an answer (with some explanations) and accept it, if it answers your original question. it is perfectly OK to post and accept your own answer. — Will Ness, Mar 14 '18 at 16:33

Will Ness · Accepted Answer · 2018-03-14T14:41:28.997

By switching to working by segments between squares of consecutive primes, creating these sets for each segment one after another.

For each segment you'll have to calculate the starting point of the enumeration of a prime's multiples, for each known prime which is not greater than the segment's top value (i.e. the next "core" prime's square).

The "core" primes, to get the squares of, you can get separately, independently, by a recursive application of the same algorithm.

An example of this approach (the separate primes supply that is) is How to implement an efficient infinite generator of prime numbers in Python?

To make it parallel, you'll need to find means to use the set in a shared fashion between all the enumerations, which each will set each of its enumerated multiples off in the same shared set. Order of operations is not important, as long as they are all finished. The access need not be guarded, as setting the same location off twice (or more) is perfectly fine.

This will also be very efficient.

score 0 · Answer 2 · answered Mar 12 '18 at 16:43

0

If you want to stay with this approach - which does have a certain simplicity, although it must be terribly inefficient - the simplest way I can see to do it without constructing a large set or re-running yield_multiple for each candidate is to sort of reverse your membership check:

multiples = {c for c in yield_multiple() if c in candidate_list}
result = [c for c in candidate_list if c not in multiples]

However, unless using your own code is the most important factor here, I'd recommend finding a more efficient approach, like for example the one described in this other answer.

answered Mar 12 '18 at 16:43

Nathan Vērzemnieks

5,495
1
11
23

Sorry to mislead you guys. I don't want to find a faster way to get the prime numbers. This is my parallel programming coursework. Basically, I was trying to parallel Sieve of Eratosthenes. I should start another question. – Blaise Wang Mar 14 '18 at 05:32
@BlaiseWang no need. I edited the tags on your question. – Will Ness Mar 14 '18 at 14:36

Python generator and set(generator) get different results

2 Answers2