4

I would like to understand what is happening with a MemoryError that seem to occur more or less randomly. I'm running a Python 3 program under Docker and on an Azure VM (2CPU & 7GB RAM).

To make it simple, the program deals with binary files that are read by a specific library (there's no problem there), then I merge them by peer of files and finally insert data in a database.
The dataset that I get after the merge (and before db insert) is a Pandas dataframe that contains around ~ 2.8M rows and 36 columns.

For the insertion into database, I'm using a REST API that obliges me to insert the file by chunk. Before that, I'm transforming the datafram into a StringIO buffer using this function:

# static method from Utils class
@staticmethod
def df_to_buffer(my_df):
    count_row, count_col = my_df.shape
    buffer = io.StringIO()  #creating an empty buffer
    my_df.to_csv(buffer, index=False)  #filling that buffer
    LOGGER.info('Current data contains %d rows and %d columns, for a total 
    buffer size of %d bytes.', count_row, count_col, buffer.tell())
    buffer.seek(0) #set to the start of the stream
    return buffer

So in my "main" program the behaviour is :

# transform the dataframe to a StringIO buffer
file_data = Utils.df_to_buffer(file_df)
buffer_chunk_size = 32000000 #32MB
while True:
    data = file_data.read(buffer_chunk_size)
    if data:
        ...
        # do the insert stuff
        ...
    else:
        # whole file has been loaded
        break
# loop is over, close the buffer before processing a new file
file_data.close()

The problem :
Sometimes I am able to insert 2 or 3 files in a row. Sometimles a MemoryError occurs at a random moment (but always when it's about to insert a new file).
The error occurs at the first iteration of a file insert (never in the middle of the file). It specifically crashes on the line that does the read by chunk file_data.read(buffer_chunk_size)

I'm monitoring the memory during the process (using htop) : it never goes higher than 5,5 GB of memory and espacially when the crash occurs, it runs around ~3.5 GB of used memory at that moment...

Any information or advice would be appreciated, thanks. :)

EDIT
I was able to debug and kind of identify the problem but did not solve it yet.
It occurs when I read the StringIO buffer by chunk. The data chunk increases a lot the RAM consumption, as it is a big str that contains the 320000000 characters of file. I tried to reduce it from 32000000 to 16000000. I was able to insert some files, but after some time the MemoryError occurs again... I'm trying to reduce it to 8000000 right now.

Sathyajith Bhat
  • 21,321
  • 22
  • 95
  • 134
Flo
  • 936
  • 1
  • 8
  • 19
  • Do you get rid of the dataframe somehow? Or is it left for the garbage collector? – Piotr Kamoda Jun 27 '19 at 08:17
  • Hmm I'm not doing any `del` thing of something as I've read that we cannot really get rid of that in garbage collector (or maybe I misunderstood :p). See: https://stackoverflow.com/a/39377643/2409641. By the way the dataframe is getting "replaced" by new file data on every iteration. – Flo Jun 27 '19 at 08:18
  • Yeah, as it said it's not 'free' but it's free for overriding. So even if memory usage doesn't drop perhaps new dataframes will have less problems ovveriding previous ones because some garbage collector had to have a cig brake. – Piotr Kamoda Jun 27 '19 at 08:21
  • 2
    @PiotrKamoda one cannot free memory explicitly in Python, so the point is moot. You are always at the mercy of garbage collection. CPython uses reference counting, though, which will deterministically reclaim objects *immediately* as soon as their reference count goes to zero. – juanpa.arrivillaga Jun 27 '19 at 08:38
  • I just tried to use `del` the dataframe and its reference, no difference, still MemoryError – Flo Jun 27 '19 at 08:47
  • Perhaps reference counter is not reaching zero for some reason? It's worth a try I think, this way we can at least cross something out with 100% certainty. – Piotr Kamoda Jun 27 '19 at 08:48
  • What is the output of the LOGGER for those files that crashed, does it look ok? – Piotr Kamoda Jun 27 '19 at 08:53
  • An example: `Current data contains 2888406 rows and 36 columns, for a total buffer size of 885040853 bytes.` Thing is that sometimes files with more rows than this one are being inserted well... – Flo Jun 27 '19 at 08:55
  • How about you catch it and test for memory usage from in-proces, I've googled something called [psutil](https://pypi.org/project/psutil/) – Piotr Kamoda Jun 27 '19 at 09:00
  • Has it happened anytime for the __first__ file in the queue? Perhaps try `del file_data`. – Piotr Kamoda Jun 27 '19 at 09:04
  • Yes it just did happen for the first file in the queue (it never happened before). – Flo Jun 27 '19 at 09:07
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195618/discussion-between-piotr-kamoda-and-flo). – Piotr Kamoda Jun 27 '19 at 09:07
  • edited my question with some kind of problem identification – Flo Jun 27 '19 at 13:39
  • Just to get a sense of what you're working with, use `df.memory_usage(deep=True)` to see what the size of the dataframe you're creating is. – Liam Shalon Jun 27 '19 at 22:46
  • If you're only using csv to insert into database, why not skip the to_csv and just use df.to_sql(...chunksize=...) to insert directly into database? – altunyurt Jul 29 '19 at 22:23
  • @altunyurt because my database is not a SQL db. It is a storage like Amazon S3. – Flo Jul 30 '19 at 08:00
  • @Flo https://github.com/pandas-dev/pandas/blob/f34dbbf28b4c4d1b241f7bda155284fe0c131d18/pandas/core/generic.py#L3047 here is the code to df.to_csv. You could also have a look at other formatters in the same file. You could try to come up with an s3 solution similar to existing ones by modifying one. Otherwise, what i see is you are storing the same data in different formats in the memory, during the whole operation. Easiest would be directly passing data from df to s3 with a formatter / writer middleware similar to to_csv etc. – altunyurt Jul 30 '19 at 12:47

0 Answers0