0

I have a script that is constantly measuring some data and regularly storing it in a file. In the past I was storing the data in a "manually created CSV" file in this way (pseudocode):

with open('data.csv','w') as ofile:
    print('var1,var2,var3,...,varN', file=ofile) # Create CSV header.
    while measure:
        do_something()
        print(f'{var1},{var2},{var3},...,{varN}', file=ofile) # Store data.

I worked in this way for several months and runned this script several hundreds of times with no issues other than 1) this is cumbersome (and prone to errors) when N is large (in my case between 20 and 30) and 2) CSV does not preserve data types. So I decided to change to something like this:

temporary_df = pandas.DataFrame()
while measure:
    do_something()
    temporary_df.append({'var1':var1,'var2':var2,...,'varN':varN}, ignore_index=True)
    if save_data_in_this_iteration():
        temporary_df.to_feather(f'file_{datetime.datetime.now()}.fd')
        temporary_df = pandas.DataFrame() # Clean the dataframe.
merge_all_feathers_into_single_feather()

At a first glance this was working perfectly as I expected. However, after some hours Python crashes. After experiencing this both in a Windows and in a (separate) Linux machine I, I noticed that the problem is that Python is slowly sucking the memory of the machine until there is no more memory, and then of course it crashes.

As the function do_something is unchanged between the two approaches, and the crash happens before merge_all_feathers_into_single_feather is called, and save_data_in_this_iteration is trivially simple, I am blaming Pandas for this problem.

Google has told me that other people in the past have had memory problems while using Pandas. I have tried adding the garbage collector line in each iteration, as suggested e.g. here, but did not worked for me. I didn't tried the mutiprocessing approach yet because it looks like killing an ant with a nuke, and may bring other complications...

Is there any solution to keep using Pandas like this? Is there a better solution to this without using Pandas? Which?

user171780
  • 2,243
  • 1
  • 20
  • 43
  • 1
    Tbh the multiprocessing solution is what I would go for. It's not really "killing an ant with a nuke", it's the only guaranteed way you have to ensure all memory used by pandas when writing the Feather file is actually released. I would maybe not use `Pool(1)`, but rather explicitly use a `Process` [as done here](https://stackoverflow.com/a/2046630/3214872), but the underlying idea is the same. – GPhilo Oct 29 '21 at 08:49
  • Thanks for your comment. I will try that. I have just made a simple MWE with Pandas and there were no memory issues, so there might be something else hidden. I will post updates in a couple of days after testing the multiprocessing approach. – user171780 Oct 29 '21 at 09:11

1 Answers1

0

Pandas was not the problem

After struggling with this problem for a while, I decided to create a MWE to do some tests. So I wrote this:

import pandas
import numpy
import datetime

df = pandas.DataFrame()
while True:
    df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
    if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
        last_clean = datetime.datetime.now()
        df.to_feather('delete_me.fd')
        df = df[0:0]

To my surprise, the memory is not drained by this script! So here I concluded that Pandas was not my problem.

Then I added a new component to the MWE and I found the issue:

import pandas
import numpy
import datetime
import matplotlib.pyplot as plt

def save_matplotlib_plot(df):
    fig, ax = plt.subplots()
    ax.plot(df['col_1'], df['col_2'])
    fig.savefig('delete_me.png')
    # Uncomment the following two lines to release the memory and stop the "leak".
    # ~ fig.clear()
    # ~ plt.close(fig)

df = pandas.DataFrame()
while True:
    df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
    if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
        last_clean = datetime.datetime.now()
        df.to_feather('delete_me.fd')
        save_matplotlib_plot(df) # Here I had my "leak" (which was not a leak indeed because matplotlib keeps track of all the figures it creates, so it was working as expected).
        df = df[0:0]

It seems that when I switched from "handmade CSV" to "Pandas" I also changed something with the plots, so I was blaming Pandas when it was not the problem.

Just for completeness, the multiprocessing solution also works. The following script has no memory issues:

import pandas
import numpy
import datetime
import matplotlib.pyplot as plt
from multiprocessing import Process

def save_matplotlib_plot(df):
    fig, ax = plt.subplots()
    ax.plot(df['col_1'], df['col_2'])
    fig.savefig('delete_me.png')

df = pandas.DataFrame()
while True:
    df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
    if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
        last_clean = datetime.datetime.now()
        df.to_feather('delete_me.fd')
        p = Process(target=save_matplotlib_plot, args=(df,))
        p.start()
        p.join()
        df = df[0:0]
user171780
  • 2,243
  • 1
  • 20
  • 43