I need to create a list with some millions of elements, (between 10^6 and 10^7), that receives data from a external file.
Is it possible to store those elements in list without freezing up system or is there any other methods that I need to use?
I need to create a list with some millions of elements, (between 10^6 and 10^7), that receives data from a external file.
Is it possible to store those elements in list without freezing up system or is there any other methods that I need to use?
EDIT
Based upon discussion in order to load huge data from file onto a list. I would recommend reading data in chunks as generators from file and then combining those generators using itertools.chain
to get concatenated generator. The final generator then can be iterated over for further manipulation/ processing. This way we use memory efficiently.
Below is a function that reads from file in chunk and returns a generator of chunk.
def read_data_chunks(file_object, chunk_size=1024):
"""read file in chunks using Lazy method (generator)
chunk size: default 1k """
while True:
data = file_object.read(chunk_size)
#data = data.strip().rstrip('\n')
if not data:
break
yield data.strip()
Next, we read data in chunks from read_data_chunks function and combine different pieces together.
from itertools import chain
f = open('numbers1.txt')
gen = iter([]) #start off with an empty generator
#adjust chunk size as needed, 10k here, change as applicable
#you can experiment with bigger chunks for huge file.
for piece in read_data_chunks(f, chunk_size=10240):
gen=chain(gen,piece)
Now you can access final generator for further processing (e.g. iterate over) just like previous answer.
for i in gen:
print i
Previous Answer
If you just want to make a list of 10^6 sequenced digits, you can do as follows. The list is created using an generator comprehension which doesn't actually generate the items until accessed (lazy evaluation).
If we try to create using list, it will run into memory error (for larger values depending upon your 32/64bit OS). For e.g. on my windows 64 bit os it runs into error at 10**9.
#memory efficient as we are not actually creating anything at this time.
>>> x = (i for i in xrange(10**6)) #ok with gen comprehension
>>> x = (i for i in xrange(10**8)) #ok with gen comprehension
>>> y = [i for i in xrange(10**8)] #runs into error @ 10**8, ok at 10**6 , 10**7
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
y = [i for i in xrange(10**8)]
MemoryError
>>>
Even with generator comprehesnion, After 10**10 you start hitting the limit.
Then you will need to switch to different avenue such as pytables
, pandas
, or databases.
>>>
>>> x = (i for i in xrange(10**6))
>>> x = (i for i in xrange(10**8))
>>> x = (i for i in xrange(10**9))
>>> x = (i for i in xrange(10**10))
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
x = (i for i in xrange(10**10))
OverflowError: Python int too large to convert to C long
You can iterate the generator as you would over a normal list.
>>> for i in x:
print i
if i >= 5:
break
0
1
2
3
4
5
Read more on generator expressions here.