Python - file processing - memory error - speed up the performance

Question

I'm dealing with huge numbers. I have to write them into a .txt file. Right now I have to write the all numbers between 1000000,10000000(1M-1B) into a .txt file. Since it throws me memory error if I do it in a single list, I sliced them ( I don't like this solution but couldn't find any other ).

The problem is, even with the first 50M numbers (1M-50M), I can't even open the .txt file. It's 458MB and took around + 15 mins, so I guess it'll be around a 9GB .txt file and +4 hours if I write all numbers.

When I try to open the .txt file contains numbers between 1M-50M

myfile.txt has stopped working

So right now the file contains the numbers between 1M-50M and I can't even open it, I guess if I write all numbers it's impossible to open.

I have to shuffle numbers between 1M-1B and store this numbers into a .txt file right now. Basically it's a freelance job and I'll have to deal with bigger numbers like 100B etc. Even first 50M has this problem, I don't know how to finish when the numbers are bigger.

Here are the codes for 1M-50M

import random

x = 1000000
y = 10000000


while x < 50000001:
    nums = [a for a in range(x,x+y)]
    random.shuffle(nums)
    with open ("nums.txt","a+") as f:
        for z in nums:
            f.write(str(z)+"\n")
        x += 10000000

How can I speed up this process?

How can I open this .txt file, should I create new file every time? If I choose this option I have to slice the numbers more since even 50M numbers has problem.

Is there any module can you suggest may be useful for this process?

I suggest you rethink your process on this, whatever you're doing. Opening up a 9 GB text file requires *at least 9 GB* of RAM. Furthermore, it's likely that there's a better approach to whatever problem you're solving... writing integers to a text file is usually not a good approach. — hichris123, May 22 '16 at 20:21
Plain text is not an efficient way to store that much numeric data. What are you doing that makes you think you want to do that? — BrenBarn, May 22 '16 at 20:21
@TankorSmash Client said it's about a FB app , for verification code. Don't know more, I just have to do this. — GLHF, May 22 '16 at 20:24
@hichris123 I didn't know that 9GB file need 9GB RAM. So the best way is I have to create new files each time, but what if I have to store them in a single file? If it's really not a good thing... then I want to choose that option. — GLHF, May 22 '16 at 20:26
@hichris123: Opening a huge file doesn't consume much RAM, but trying to read the whole thing at once certainly does. — PM 2Ring, May 22 '16 at 20:27
@BrenBarn It's a freelance job the request is 'for now', store the numbers into a .txt file. — GLHF, May 22 '16 at 20:27
You need to use generators (`yield`, `next`), not dump the entire list into memory in one piece. — C14L, May 22 '16 at 20:37
This is slow because you have to create a huge list in order to shuffle the numbers. Now, if you batch it, you are shuffling only inside a batch, so it loses the randomness across batches. Is shuffling a requirement ? As it is, it looks wrong. If not required, don't shuffle and iterate (with `for x in range(...):`) instead of creating a list. — JulienD, May 22 '16 at 21:05
@JulienD Yes I have to shuffle it. That's why I stuck. Normally I don't use .txt file. Requirements are: shuffle them, store them into a .txt file — GLHF, May 22 '16 at 21:09
Does the text file have to be readable? Try to store all numbers in a numpy array, then use things like np.tofile, np.save, np.savetxt or np.dump. — JulienD, May 22 '16 at 21:16
@JulienD Yes that's why I asked how can I open the file. I guess I have to create a file each time a loop done. — GLHF, May 22 '16 at 21:17
It does sound like the application would be better served by something that can take a seed and use it to generate the same ordered list of numbers each time — Simon Fraser, May 22 '16 at 21:20
Anyway it is highly probable that `np.savetxt` or `np.ndarray.tofile` from a numpy array will do the brute force job much more efficiently. — JulienD, May 22 '16 at 22:05
What if you just write them sequentially to the file first, and then shuffle the file? [See here](https://stackoverflow.com/questions/4618298/randomly-mix-lines-of-3-million-line-file). — Will, May 22 '16 at 22:33
@Will It'll raise MemoryError. That question about 3M lines, I have 999M lines. — GLHF, May 22 '16 at 22:35
FWIW, there's no need to create a huge list and shuffle it. There are various ways to rapidly generate non-repeating pseudorandom sequences that use very little memory. And I don't see the point in saving such a huge list to disk: it's much more efficient to simply generate the numbers as you need them. — PM 2Ring, May 23 '16 at 14:51
@PM 2Ring It is not the same. You would get repeats, and if you want to avoid repeats you need to keep a list of already used numbers. — JulienD, May 24 '16 at 20:22
@JulienD: You do **not** get repeats if you use the right pseudorandom generators. FWIW, I did a test using a [Linear Feedback Shift Register](https://en.wikipedia.org/wiki/Linear-feedback_shift_register) that generates all the numbers in the range(1, 2**26) in a pseudorandom order. On my old 2GHz machine it took a few seconds over 4 minutes to generate those 67108863 numbers and save them to disk as decimals, one per line. — PM 2Ring, May 24 '16 at 23:19

Maximilian Peters · Answer 1 · 2016-05-22T22:44:37.837

Is there any module can you suggest may be useful for this process?

Using Numpy is really helpful for working with large arrays.

How can I speed up this process?

Using Numpy's functions arange and tofile dramatically speed up the process (see code below). Generation of the initial array is about 50 times faster and writing the array to a file is about 7 times faster.

The code just performs each operation once (change number=1 to a higher value to get better accuracy) and only generates number up to between 1M and 2M but you can see the general picture.

import random
import timeit
import numpy

x = 10**6
y = 2 * 10**6

def list_rand():
    nums = [a for a in range(x, y)]
    random.shuffle(nums)
    return nums

def numpy_rand():
    nums = numpy.arange(x, y)
    numpy.random.shuffle(nums)
    return nums

def std_write(nums):
    with open ('nums_std.txt', 'w') as f:
        for z in nums:
            f.write(str(z) + '\n')

def numpy_write(nums):
    with open('nums_numpy.txt', 'w') as f:
        nums.tofile(f, '\n')

print('list generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='list_rand()', setup='from __main__ import list_rand', number=1)))

print('numpy array generation, random [secs]')
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_rand()', setup='from __main__ import numpy_rand', number=1)))

print('standard write [secs]')
nums = list_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='std_write(nums)', setup='from __main__ import std_write, nums', number=1)))

print('numpy write [secs]')
nums = numpy_rand()
print('{:10.4f}'.format(timeit.timeit(stmt='numpy_write(nums)', setup='from __main__ import numpy_write, nums', number=1)))



list generation, random [secs]
    1.3995
numpy array generation, random [secs]
    0.0319
standard write [secs]
    2.5745
numpy write [secs]
    0.3622

How can I open this .txt file, should I create new file every time? If I choose this option I have to slice the numbers more since even 50M numbers has problem.

It really depends what you are trying to do with the numbers. Find their relative position? Delete one from the list? Restore the array?

I couldn't turn it like my code actually, didn't use numpy before. Could you slice it like every 20M so I don't get MemoryError — GLHF, May 23 '16 at 00:13
You can use `arange` with a start and stop position, so, yes, you create slices in `numpy` just like you did with your list. What computer are you using for this job? — Maximilian Peters, May 23 '16 at 06:16

score 0 · Answer 2 · answered May 22 '16 at 23:34

0

I would not help You with the Python, but if You need to shuffle a consecutive sequence, You can improve the shuffling algorithm. Make a bit array of 1E9 items, if would be about 125MB. Generate random number. If it is not present in the bit array, add it there and write it to the file. Repeat until You have 99% of numbers in the file.

Now convert the unused numbers in bit array into ordinary array - it would be 80MB. Shuffle them and write to the file.

You needed about 200MB of memory for 1E9 items (and 8 minutes, written in C#). You should be able to shuffle 100E9 items in 20GB of RAM and less than a day.

answered May 22 '16 at 23:34

Antonín Lejsek

6,003
2
16
18

do you mean `1E9` as hexidecimal? That is only 489 which is far too few items for the actual numbers in the question... – Tadhg McDonald-Jensen May 22 '16 at 23:55
Plus there is no bitarray module for Python 3.4. At least I couldn't find, I'm still looking. – GLHF May 22 '16 at 23:59
@TadhgMcDonald-Jensen Even Python understands 1E9 as a number without problem. It is 1000000000. – Antonín Lejsek May 23 '16 at 03:27
@GLHF There seem to be packages, but I admit I did not investigate them. – Antonín Lejsek May 23 '16 at 03:29
ooh, it is scientific notation, 1e9 -> 10^9, ok got it. – Tadhg McDonald-Jensen May 23 '16 at 05:25

Python - file processing - memory error - speed up the performance

2 Answers2