How to input a huge string of integer array in Python in less memory heap

Question

I have an input test file

10000000
1 23 53 64 599 -645 746 84 944 10 ..(10000000 integers)

The python3 code that I use to take the input is as follow

t = int(input())
a = [int(i) for i in input().split()]

Since the 2nd line is too large, my program takes huge amount of RAM ( < 1GB for a short interval). The same implementation in C/C++ using cin/scanf:

int a;
cin >> a;
bitset<10000000> visited;
while (a--)
{
    int x;
    scanf("%d",&x);
    visited[x] = true;
}

uses as little as 7MB. Is there any way to reduce this? Getting the integer inputs without loading the whole string into the memory at once (like loading the string partially)?

Certainly. The standard input stream is `sys.stdin`; you can read characters from it with e.g. `.read(64)` to read 64 characters (or bytes for a binary stream) at a time. — AKX, Sep 15 '20 at 16:31
[link](https://stackoverflow.com/questions/1896674/python-how-to-read-huge-text-file-into-memory) this will help you! — Aakash Dinkar, Sep 15 '20 at 16:33
is there any way to read upto a certain number of integers? this might end up reading an integer halfway — Snowfox, Sep 15 '20 at 16:34
@Snowfox You can also always read only one byte at a time with `.read(1)`, but it will be less efficient. — AKX, Sep 15 '20 at 16:34

score 1 · Answer 1 · answered Sep 15 '20 at 16:49

No, you're stuck with writing your own input stream handler. You will need to read a buffer (your choice of buffer size), split what you can, and save any partial integers at the end of the line.

For instance:

leftover = ''

while True:
    buffer = leftover + sys.stdin.read(64)
    str_num = buffer.split()
    if buffer[-1] != ' ':
        leftover = str_num.pop(-1)
    new_values = [int(i) for i in str_num]
    # process new values

When you hit EOF, catch/detect the condition.

score 0 · Answer 2 · answered Sep 15 '20 at 16:49

To expand upon my comments:

This implementation, according to psutil, takes about 26 megabytes of RSS memory total, with 10 megabytes having been allocated during read_file().

(As an added bonus, since we have the luxury of using a full byte of memory per integer, we can count exactly how many of each there was (unless we overflow the 8 bits...).)

import random
import array
import psutil


def read_file(inf):
    n = int(inf.readline())
    # Preallocate an array.
    # TODO: this uses one byte per `n`, not one bit.
    # The allocation also takes some additional temporary memory
    # due to the list initializer.
    arr = array.array("b", [0] * n)

    input_buffer = ""
    nr = 0
    while True:
        # Read a chunk of data,
        read_buffer = inf.read(131072)
        # Add it to the input-accumulating buffer
        input_buffer += read_buffer
        # Partition the accumulating buffer from the right,
        # and swap the "rest" (after the last space) to be
        # the new accumulating buffer.
        process_chunk, sep, input_buffer = input_buffer.rpartition(" ")
        if not process_chunk:  # Nothing to process anymore
            break
        for value in process_chunk.split(" "):
            arr[int(value)] += 1
            nr += 1
    assert n == nr
    return arr


mi0 = psutil.Process().memory_info()

with open("output.txt", "r") as inf:
    arr = read_file(inf)

mi1 = psutil.Process().memory_info()
print("initial memory usage", mi0.rss)
print("final memory usage..", mi1.rss)
print("delta from initial..", mi1.rss - mi0.rss)

How to input a huge string of integer array in Python in less memory heap

2 Answers2