I would like to understand what is happening with a MemoryError that seem to occur more or less randomly. I'm running a Python 3 program under Docker and on an Azure VM (2CPU & 7GB RAM).
To make it simple, the program deals with binary files that are read by a specific library (there's no problem there), then I merge them by peer of files and finally insert data in a database.
The dataset that I get after the merge (and before db insert) is a Pandas dataframe that contains around ~ 2.8M rows and 36 columns.
For the insertion into database, I'm using a REST API that obliges me to insert the file by chunk. Before that, I'm transforming the datafram into a StringIO buffer using this function:
# static method from Utils class
@staticmethod
def df_to_buffer(my_df):
count_row, count_col = my_df.shape
buffer = io.StringIO() #creating an empty buffer
my_df.to_csv(buffer, index=False) #filling that buffer
LOGGER.info('Current data contains %d rows and %d columns, for a total
buffer size of %d bytes.', count_row, count_col, buffer.tell())
buffer.seek(0) #set to the start of the stream
return buffer
So in my "main" program the behaviour is :
# transform the dataframe to a StringIO buffer
file_data = Utils.df_to_buffer(file_df)
buffer_chunk_size = 32000000 #32MB
while True:
data = file_data.read(buffer_chunk_size)
if data:
...
# do the insert stuff
...
else:
# whole file has been loaded
break
# loop is over, close the buffer before processing a new file
file_data.close()
The problem :
Sometimes I am able to insert 2 or 3 files in a row. Sometimles a MemoryError occurs at a random moment (but always when it's about to insert a new file).
The error occurs at the first iteration of a file insert (never in the middle of the file). It specifically crashes on the line that does the read by chunk file_data.read(buffer_chunk_size)
I'm monitoring the memory during the process (using htop
) : it never goes higher than 5,5 GB of memory and espacially when the crash occurs, it runs around ~3.5 GB of used memory at that moment...
Any information or advice would be appreciated, thanks. :)
EDIT
I was able to debug and kind of identify the problem but did not solve it yet.
It occurs when I read the StringIO buffer by chunk. The data chunk increases a lot the RAM consumption, as it is a big str
that contains the 320000000 characters of file.
I tried to reduce it from 32000000 to 16000000. I was able to insert some files, but after some time the MemoryError occurs again... I'm trying to reduce it to 8000000 right now.