2

I need to read an arbitrarily big file, parse it (which means keep some data in memory while doing it), then write a new version of the file to the file system. Given the memory constraints, I need to read the file incrementally or in batches. However, the bigger the batches, the better (because the information used to parse each line of the file is contained in the other lines of the file).

Apparently, I can get information about memory usage with something like

import psutil
psutil.virtual_memory()

which also returns the memory available in percentage. See this answer https://stackoverflow.com/a/11615673/3924118 for more info.

I would like to determine the size of the batches based on available memory and based on the memory used by and reserved for the current Python process.

Apparently, I can get the memory used by the current Python process with

import os
import psutil
process = psutil.Process(os.getpid())
print(process.memory_info().rss)  # in bytes 

See https://stackoverflow.com/a/21632554/3924118 for more info.

So, is there a way of having an adaptive batch size (or generator), based on the available memory dedicated to the current Python process and the total system available memory, so that I can read as many lines as the available memory allows at a time, then read the next batch of lines, etc.? In other words, I need to incrementally read the file, such that the number of lines read at once is maximized, while satisfying memory constraints (within a certain threshold, for example, until 90% of the memory is used).

nbro
  • 15,395
  • 32
  • 113
  • 196

1 Answers1

-1

I would fix the size of the data you were reading at a time rather than attempt to randomly fill you memory. Read your data in fixed blocks. Much easier to deal with.

  • 1
    I cannot do this. I really need to incrementally read the file, such that the number of lines read at once is maximized. – nbro Sep 26 '19 at 15:41