2

We have a massive pandas dataframe in our code - shape is (102730344, 50). In order to free up memory, we put in a del of this dataframe once it's no longer needed. That del statement is taking 4 hours to run currently on powerful hardware. Is there a way to speed this up?

Here's the code flow:

big_data_df, small_df, medium_data, smaller_df = get_data(params)
#commented out code
del big_data_df # this takes 4 hours

So we call a function that returns 4 dataframes, one of which is the big dataframe we want to later delete. We've commented out the code between getting the dataframe and deleting it when no longer needed for testing. The del then runs, and a logging statement following that execution shows a runtime of 4 hours.

Jeff Doing
  • 31
  • 3
  • 1
    What are you doing, **exactly**. Note, `del` by itself does not free memory. It removes a name, in the simplest case, `del some_name`. It is also part of `del some_container[item]`, where it simply class `some_container.__delitem__(item)`. – juanpa.arrivillaga Oct 01 '20 at 19:41
  • Some related reading: https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html – dfundako Oct 01 '20 at 19:42
  • 1
    What data types are in this `dataframe`? If its `object`, then all of the individual objects need to be unreferenced and deleted. – tdelaney Oct 01 '20 at 19:44
  • 1
    Do answer @tdelaney's question. it's important. If the major type is `object`, then try to use a Python in the (at least) 3.8 series too, for reasons partly explained here: https://stackoverflow.com/questions/63348685/python-keeps-running-for-10mins-after-last-statement-in-program-when-there-is/63349560#63349560 – Tim Peters Oct 01 '20 at 19:54
  • Yes, we're reading the data in from SQL using ```read_sql``` which is returning a dataframe in which most of the columns are objects. – Jeff Doing Oct 01 '20 at 20:06
  • Avoiding the use of `object` has a number of speed and memory use benefits. Can you replace them with floats, or categoricals? https://pythonspeed.com/articles/pandas-load-less-data/ If nothing else, can you replace them with fixed-width NumPy strings? https://stackoverflow.com/questions/34881079/pandas-distinction-between-str-and-object-types – Nick ODell Oct 01 '20 at 21:02

1 Answers1

1

You could create the large dataframe in a subprocess, but only send the stuff you want to the parent, and then use os_exit() to skip individual object cleanup. Whether this works for you depends on the relative size of the data being returned. In your case, the SQL and dataframe creation / processing could potentially be done in the subprocess. In this example, I send the result on stdout, but it would also be reasonable to save to a temporary file instead. I'm using pickle, but other serializers such as pyarrow may be faster.

....and it may not work in you case at all.

dfuser.py

import sys
import subprocess as subp
import pandas as pd

try:
    proc = subp.Popen([sys.executable, 'dfprocessor.py'], stdin=subp.PIPE, stdout=subp.PIPE, stderr=None)
    df = pd.read_pickle(proc.stdout, compression=None)
    print("got df")
    proc.stdin.write(b"thanks\n")
    proc.stdin.close()
    proc.wait()
    print(df)
finally:
    print('parent done')

dfcreator.py

import pandas as pd
import sys
import os

try:
    # add your df creation and processing here
    df = pd.util.testing.makeDataFrame()
    small_df = df # your processing makes it smaller
    # send
    small_df.to_pickle(sys.stdout.buffer, compression=None)
    sys.stdout.close()
    # make sure received
    sys.stdin.read(1)
finally:
    # exit without deleting df to save time
    sys.stderr.write("out of here\n")
    os._exit(0)
tdelaney
  • 73,364
  • 6
  • 83
  • 116