Python Memory Mystery with pd.DataFrames, np.ndarrays, and multiprocessing's Queue

Question

Introduction

Hello Fellow Internauts!

I am encountering a strange error when working with three popular Python libraries: pandas, NumPy, and multiprocessing. Whenever I put a pandas DataFrame (containing NumPy arrays) into a multiprocessing queue, the memory of the data frame is altered. This is visible by checking the memory of the data frame before putting it and after retrieving it from the queue.

System and Environment

Ubuntu 20.04
python==3.8.5
numpy==1.21.4
pandas==1.3.4

Bug Replication

Here is the Python interpreter session that showcases the issue:

First, I imported the involved libraries:

>>> import numpy as np
>>> import pandas as pd
>>> import multiprocessing as mp

Then, I created a basic array and a pandas DataFrame holding 3 arrays. The memory usage thus far makes sense.

>>> array = np.ones([720, 1280, 3], dtype=np.uint8)
>>> df = pd.DataFrame({'frames': [array, array, array]})
>>> array.nbytes
2764800
>>> df.memory_usage(deep=True).sum()
8294960

Where things become unclear to me is how after placing and retrieving the pandas DataFrame into the queue. The memory is reduced significantly! Where does it go?!

>>> queue = mp.Queue(maxsize=5)
>>> queue.put(df)
>>> new_df = queue.get()
>>> new_df.memory_usage(deep=True).sum()
560

Initial Attempts

Does it affect np.ndarrays as well? (Nope)

>>> queue.put(array)
>>> new_array = queue.get()
>>> new_array.nbytes
2764800

Perhaps, I should copy the DataFrame before putting into the queue? (Nope)

>>> queue.put(df.copy())
>>> new_df = queue.get()
>>> new_df.memory_usage(deep=True).sum()
560

Is this present in other Queue implementations? (Nope)

>>> from queue import Queue
>>> new_queue = Queue(maxsize=5)
>>> new_queue.put(df)
>>> new_df = new_queue.get()
>>> new_df.memory_usage(deep=True).sum()
8294960

Conclusion

As much as I want to just avoid this issue by using queue.Queue, I am developing a multiprocessing program that won't work with queue.Queue.

Any help would be greatly appreciated and I am sorry if this is a silly mistake that is noted in the multiprocessing documentation.

EDIT #1 (03/13/2022 at 1:35 pm CST):

After further testing, I determined that this issue is caused by NumPy arrays. Here is an example:

>>> import sys
>>> array = np.ones([720, 1280, 3], dtype=np.uint8)
>>> array_dict = {'frames': [array, array, array]}
>>> array.nbytes
2764800
>>> queue = mp.Queue(maxsize=5)
>>> queue.put(array_dict)
>>> new_array_dict = queue.get()
>>> sys.getsizeof(array_dict['frames'][0])
2764944
>>> sys.getsizeof(new_array_dict['frames'][0])
144
>>> array_dict['frames'][0].nbytes
2764800
>>> new_array_dict['frames'][0].nbytes
2764800

Not sure how to fix this issue, but the pd.DataFrame memory usage report is incorrect because sys.getsizeof is obtaining the wrong memory usage of the NumPy arrays. I will continue debugging until the culprit is found!

Eduardo Davalos · Answer 1 · 2022-03-14T03:25:17.057

Based on the discussion in this SO forum, I noticed that the issue was the pickling (which is done by the multiprocessing.Queue internally) of NumPy arrays and the sys.getsizeof function. Here is an example:

>>> import numpy as np
>>> import pickle
>>> import sys
>>> with open('test.pkl', 'wb') as f:                                                                                                                                                                      
...     pickle.dump(a, f)                                                                                                                                                                                  
...                                                                                                                                                                                                        
>>> with open('test.pkl', 'rb') as f:                                                                                                                                                                      
...     new_a = pickle.load(f)                                                                                                                                                                             
...                                                                                                                                                                                                        
>>> sys.getsizeof(a)
22118544
>>> sys.getsizeof(new_a)
144

One solution, not perfect but sufficient, comes from the linked SO forum. It uses pickle.dump to measure the memory usage. It works for the initial use-case as well.

>>> import pandas as pd
>>> import multiprocessing as mp                                                                                                                                              
>>> df = pd.DataFrame({'frames': [a,a,a]})                                                                                                                                                                                                                                                                                                                                         
>>> q = mp.Queue(maxsize=5)                                                                                                                                                                                
>>> q.put(df)                                                                                                                                                                                              
>>> new_df = q.get()                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
>>> len(pickle.dumps(new_df))                                                                                                                                                                              
22119157                                                                                                                                                                                                   
>>> len(pickle.dumps(df))                                                                                                                                                                                  
22119157

If anyone has a better solution, I would gradly appreciate it and accept it. Thank you!