-2

I need to create a list with some millions of elements, (between 10^6 and 10^7), that receives data from a external file.

Is it possible to store those elements in list without freezing up system or is there any other methods that I need to use?

Pac0
  • 21,465
  • 8
  • 65
  • 74
Shiv Rajawat
  • 898
  • 9
  • 21
  • 2
    How much RAM do you have? – Klaus D. Jan 23 '18 at 17:29
  • That would mean you need approx ~20 Gigabytes of memory... Sounds way to much for any reasonable application. It also depends on the system if it can represent such a list. – Willem Van Onsem Jan 23 '18 at 17:30
  • 1
    Assuming you actually *need* to do this, see [Very large matrices using Python and NumPy](https://stackoverflow.com/questions/1053928/very-large-matrices-using-python-and-numpy) – Alex K. Jan 23 '18 at 17:31
  • 1
    https://stackoverflow.com/questions/855191/how-big-can-a-python-array-get – Brandon Barney Jan 23 '18 at 17:31
  • @AlexK. i need to create a list with following constraint - 1 – Shiv Rajawat Jan 23 '18 at 17:52
  • @Shiv Rajawat, do you want a list to generate 10^6 numbers in sequence OR hold them in a list from an external source (user input / file etc)? Cause depending upon that, suggestions can vary. – Anil_M Jan 23 '18 at 18:22
  • 1
    @ShivRajawat `1 < size < 10^6` is not the same as `len(list) > 10^6` Your question title implies you want `len(list) > 10^6` so your clarification only makes your question more confusing. – Joey Harwood Jan 23 '18 at 18:22
  • @Anil_M i want to hold them in a list from external source (codechefs input file). But the information you gave is quite useful too. – Shiv Rajawat Jan 23 '18 at 19:32
  • @ShivRajawat , Ive updated answer with generator method to read huge file in chunks and offload to a combined generator to be processed further. Let me know if it works for your situation. – Anil_M Jan 23 '18 at 21:02
  • @Anil_M its working completely fine. Great! – Shiv Rajawat Jan 24 '18 at 03:19
  • @ShivRajawat , I am glad it worked out. Do you mind accepting the answer by clicking check-mark next to my answer? that way the question appears resolved to community. Up-vote will be appreciated as well. – Anil_M Jan 24 '18 at 03:29
  • @Anil_M upvoting requires some minimum reputation, and i don't have that. Appreciating your effort though. Guys like you are a big help. – Shiv Rajawat Jan 24 '18 at 04:19

1 Answers1

3

EDIT
Based upon discussion in order to load huge data from file onto a list. I would recommend reading data in chunks as generators from file and then combining those generators using itertools.chain to get concatenated generator. The final generator then can be iterated over for further manipulation/ processing. This way we use memory efficiently.

Below is a function that reads from file in chunk and returns a generator of chunk.

def read_data_chunks(file_object, chunk_size=1024):
    """read file in chunks using Lazy method (generator)
    chunk size: default 1k """

    while True:
        data = file_object.read(chunk_size)
        #data = data.strip().rstrip('\n')
        if not data:
            break
        yield data.strip()

Next, we read data in chunks from read_data_chunks function and combine different pieces together.

from itertools import chain   

f = open('numbers1.txt')
gen = iter([]) #start off with an empty generator

#adjust chunk size as needed, 10k here, change as applicable
#you can experiment with bigger chunks for huge file.
for piece in read_data_chunks(f, chunk_size=10240): 
    gen=chain(gen,piece)

Now you can access final generator for further processing (e.g. iterate over) just like previous answer.

for i in gen: 
    print i

Previous Answer

If you just want to make a list of 10^6 sequenced digits, you can do as follows. The list is created using an generator comprehension which doesn't actually generate the items until accessed (lazy evaluation).

If we try to create using list, it will run into memory error (for larger values depending upon your 32/64bit OS). For e.g. on my windows 64 bit os it runs into error at 10**9.

#memory efficient as we are not actually creating anything at this time.
>>> x = (i for i in xrange(10**6))  #ok with gen comprehension
>>> x = (i for i in xrange(10**8))  #ok with gen comprehension

>>> y = [i for i in xrange(10**8)]  #runs into error @ 10**8, ok at 10**6 , 10**7 

Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    y = [i for i in xrange(10**8)]
MemoryError
>>> 

Even with generator comprehesnion, After 10**10 you start hitting the limit. Then you will need to switch to different avenue such as pytables, pandas, or databases.

 >>> 
    >>> x = (i for i in xrange(10**6))
    >>> x = (i for i in xrange(10**8))
    >>> x = (i for i in xrange(10**9))
    >>> x = (i for i in xrange(10**10))

    Traceback (most recent call last):
      File "<pyshell#4>", line 1, in <module>
        x = (i for i in xrange(10**10))
    OverflowError: Python int too large to convert to C long

You can iterate the generator as you would over a normal list.

    >>> for i in x:
        print i
        if i >= 5:
            break


0
1
2
3
4
5

Read more on generator expressions here.

Anil_M
  • 10,893
  • 6
  • 47
  • 74