0

I do load large DataFrame in python3 and take a small subset of it.

I would expect Python to remove original large dataframe object from memory including its name, reference and value. Although, this is not happening because memory is not decreasing. Why?

This gives me a huge problem and I do not know how to release the memory.

before 130564096
loading files
just before taking subset 3827941376
7258946128
56
after 3803156480

This is the code:

from pympler.asizeof import asizeof

process = psutil.Process(os.getpid())
basepath = os.getcwd()

print("before", process.memory_info().rss)

def load_files(file_name, file_ext):
    filename = "%s.%s" %(file_name, file_ext)
    filepath = path.abspath(path.join(basepath, "..", "..", "data_input", filename))

    with open(filepath, 'rb') as pickle_load:
        df = pickle.load(pickle_load)

    print("just before taking subset", process.memory_info().rss)
    print(asizeof(df))

    df2 = df[:100].copy(deep=True)
    del df

    gc.collect()

    df = pd.DataFrame()
    df = ''

    gc.collect()

    print(asizeof(df))
    print("after", process.memory_info().rss)

    exit()

    return df2
  • Possible duplicate of [How do I release memory used by a pandas dataframe?](https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe) – Georgy Apr 20 '18 at 10:49
  • I have seen it before I pasted my question and I did not find the answer there. You can see some of the things implemented but the problems still exists. I find no solution for this problem hence why I decided to paste new question because I want to solve the problem. –  Apr 20 '18 at 11:04
  • As far as I know `gc` only kills reference cycles, so when you have only a handful of single references those should already be killed automatically when their reference count goes to 0. You aren't using a heavy-weight interactive shell such as ipython/jupyter, are you? – Andras Deak -- Слава Україні Apr 20 '18 at 11:21
  • No, I am not. I kick off the code from the terminal. –  Apr 20 '18 at 13:01
  • If the memory is freed but not returned to the OS, will it not show up in `memory_info().rss`? I'm not familiar with the subject but I find it conceivable that the memory could be reused by python, it's just not given back to the OS. Could you somehow test whether allocating something large increases the allocated memory once again, or perhaps it reuses what was left after `del df`? – Andras Deak -- Слава Україні Apr 20 '18 at 14:00
  • 1
    I am not sure about it. I do use @profiler from memory_profile to wrap the function and it returns similar results. I do observer memory is not shrinking too so it tells me the value of the object stays in the memory (the original df). –  Apr 20 '18 at 14:46
  • If all else fails you could try the second-top answer on the suggested duplicate: delegating the filtering step to a separate process via `multiprocessing`. – Andras Deak -- Слава Україні Apr 20 '18 at 15:01
  • did you find any solution to this ? – Sreenath Nov 24 '20 at 14:09

0 Answers0