Handling a List of 1000000+ Elements

Question

I am trying to find a way to find the distribution of prime gaps for primes less than 100,000,000.

My method:

Step 1: Start with a TXT file "primes.txt" which has a list of primes (up to 10,000,000).

Step 2: Have the program read the file and then insert each number into a list, p1.

Step 3: Find the square root of the upper bound (10 times the upper bound of the primes in the TXT file, in this case, 100,000,000) and create another list, p2, with all the primes less than or equal to that square root, rounded up.

Step 4: Define an isPrime() method that checks whether the input is a prime number (N.B.: because I know that the numbers that will be checked are all less than 100,000,000, I only have to check whether the number is divisible by all the primes less than or equal to the square root of 100,000,000, which is 10,000)

Step 5: Add a list l which collects all the prime gaps, then iterate from 1 to 100,000,000, checking the primality of each number. If the number IS prime, then record the gap between it and the last prime number before it, and also write it to another document "primes2.txt".

Step 6: Output the list l.

The problem:

The program seems to take a LONG time to run. I have a feeling that the problem has to do with how I am managing the list due to its size (the Prime Number Theorem estimates about 620,420 elements in that list from "primes.txt"). Is there a way to reduce the runtime for this program by handling the list differently?

I've attached my code below.

import math
import sys

f = open("primes.txt","r")
p1 = []
for i in f:
    p1.append(int(i))
f.close()
ml = 10000000
ms = math.sqrt(10*ml)
p2 = []
x1 = 0
while x1 < len(p1) and p1[x1] <= int(ms+0.5):
    p2.append(p1[x1])
    x1 += 1

def isPrime(n):
    for i in p2:
        if n%i == 0:
            if n/i == 1:
                return True
            return False
    return True

def main():
    l = [0]*1001 #1,2,4,6,8,10,12,...,2000 (1, and then all evens up to 2000)
    lastprime = -1
    diff = 0
    fileobject = open("primes2.txt",'w')
    for i in xrange(1,10*ml):
        if isPrime(i):
            if i > 2:
                diff = i - lastprime
                if diff == 1:
                    l[0] += 1
                else:
                    l[diff/2] += 1
            lastprime = i
            fileobject.write(str(i)+"\n")
        if i%(ml/100) == 0:
            print i/float(ml/10), "% complete"
    fileobject.close()
    print l

main()

EDIT: Made a change in how the program reads from the file

Hi Yes, it would take a long time to read the file and put into the list, and processing. There are few ways to improve this. Use of concurrency or parallelism would be best in a this situation. split the work around different threads/processes and let them do the task concurrently. Use multiprocessing module and spawn new process providing a chunk of data to it, put the data to output queue from which you can fetch the data.. Or using threading module create a multiple threads. — NishanthSpShetty, Aug 02 '17 at 02:20
1) Use a [better isPrime](https://stackoverflow.com/a/15285588/298607) 2) You waste memory like there is no tomorrow. Read about iterating over a file line-by-line 3) Use a sieve of Eratosthenes to find all primes between two extremes. — dawg, Aug 02 '17 at 02:23
It seems very odd to me that you're using your prime testing code to generate the smaller primes that are already in your main list. Why not just use the list of primes you read from the file? You'd still need to generate the primes from 10 million to 100 million, but skipping the smaller values should make things a bit easier. — Blckknght, Aug 02 '17 at 02:57

dawg · Accepted Answer · 2017-08-02T17:47:03.823

1

Tips:

For the first few X million, you can generate primes as fast as you can read them from a file, but you need to do it efficiently. See below. (Generating n up to 100 million on my recent MacBook Pro takes about 7 seconds in Python. Generating primes up to 100,000,000 takes 4 minutes. It will be faster in PyPy and way faster in C or Swift or with Python with Numpy)
You are careless with memory. Example: ps = f.read().split('\n') and then use that
p1 = [int(i) for i in ps] while ps sits in memory unused. Wasteful. Use a loop that reads the file line-by-line so the memory can be more efficiently used. When you read line-by-line, the file does not sit idly in memory after conversion.
There are very good reasons that big primes are useful for cryptography; they take a long time to generate. Python is not the most efficient language to tackle this with.

Here is a sieve of Eratosthenes to try:

def sieve_of_e(n):
    """ Returns  a list of primes < n """
    sieve = [True] * n
    for i in xrange(3,int(n**0.5)+1,2):
        if sieve[i]:
            sieve[i*i::2*i]=[False]*((n-i*i-1)/(2*i)+1)
    return [2] + [i for i in xrange(3,n,2) if sieve[i]]

edited Aug 02 '17 at 17:47

answered Aug 02 '17 at 04:49

dawg

98,345
23
131
206

1. While I am aware of the sieve of Eratosthenes, because I am reading from a file ("primes.txt") that already has a lot of prime numbers saved, I felt it unnecessary to use the sieve. Is it still faster to use the sieve? – 2012ssohn Aug 04 '17 at 02:43
2. I've made a change in how the program reads the file, but now it's giving me an error `for i in xrange(1,10*ml): OverflowError: long int too large to convert to int`. (For reference, `ml = 1000000000` so `10*ml = 10000000000`). How do I fix that? – 2012ssohn Aug 04 '17 at 02:46
3. What other languages are there that can handle this problem more efficiently? Sorry for the multiple notifications. – 2012ssohn Aug 04 '17 at 02:47
*Is it still faster to use the sieve?* Yes. – dawg Aug 04 '17 at 14:26
If you have an older version of Python, you may need to use the `long` function. See if `int(str(2**150))` see if `long(str(1**150))` cures it. You should just use a list comprehension and read the file line by line and convert to an int as read. – dawg Aug 04 '17 at 14:34
*What other languages are there that can handle this problem more efficiently?* As stated: numpy if you want to use Python will be somewhat faster and more efficient; C or Swift – dawg Aug 04 '17 at 14:35

score 0 · Answer 2 · answered Aug 02 '17 at 04:45

0

Use the sieve of eratosthenes to upgrade the prime function. Here's a link:

Sieve of Eratosthenes - Finding Primes Python

answered Aug 02 '17 at 04:45

gku1123

11
1

Handling a List of 1000000+ Elements

2 Answers2