Introduction
Hello Fellow Internauts!
I am encountering a strange error when working with three popular Python libraries: pandas, NumPy, and multiprocessing. Whenever I put a pandas DataFrame (containing NumPy arrays) into a multiprocessing queue, the memory of the data frame is altered. This is visible by checking the memory of the data frame before putting it and after retrieving it from the queue.
System and Environment
- Ubuntu 20.04
- python==3.8.5
- numpy==1.21.4
- pandas==1.3.4
Bug Replication
Here is the Python interpreter session that showcases the issue:
First, I imported the involved libraries:
>>> import numpy as np
>>> import pandas as pd
>>> import multiprocessing as mp
Then, I created a basic array and a pandas DataFrame holding 3 arrays. The memory usage thus far makes sense.
>>> array = np.ones([720, 1280, 3], dtype=np.uint8)
>>> df = pd.DataFrame({'frames': [array, array, array]})
>>> array.nbytes
2764800
>>> df.memory_usage(deep=True).sum()
8294960
Where things become unclear to me is how after placing and retrieving the pandas DataFrame into the queue. The memory is reduced significantly! Where does it go?!
>>> queue = mp.Queue(maxsize=5)
>>> queue.put(df)
>>> new_df = queue.get()
>>> new_df.memory_usage(deep=True).sum()
560
Initial Attempts
Does it affect np.ndarrays as well? (Nope)
>>> queue.put(array)
>>> new_array = queue.get()
>>> new_array.nbytes
2764800
Perhaps, I should copy the DataFrame before putting into the queue? (Nope)
>>> queue.put(df.copy())
>>> new_df = queue.get()
>>> new_df.memory_usage(deep=True).sum()
560
Is this present in other Queue implementations? (Nope)
>>> from queue import Queue
>>> new_queue = Queue(maxsize=5)
>>> new_queue.put(df)
>>> new_df = new_queue.get()
>>> new_df.memory_usage(deep=True).sum()
8294960
Conclusion
As much as I want to just avoid this issue by using queue.Queue
, I am developing a multiprocessing program that won't work with queue.Queue
.
Any help would be greatly appreciated and I am sorry if this is a silly mistake that is noted in the multiprocessing documentation.
EDIT #1 (03/13/2022 at 1:35 pm CST):
After further testing, I determined that this issue is caused by NumPy arrays. Here is an example:
>>> import sys
>>> array = np.ones([720, 1280, 3], dtype=np.uint8)
>>> array_dict = {'frames': [array, array, array]}
>>> array.nbytes
2764800
>>> queue = mp.Queue(maxsize=5)
>>> queue.put(array_dict)
>>> new_array_dict = queue.get()
>>> sys.getsizeof(array_dict['frames'][0])
2764944
>>> sys.getsizeof(new_array_dict['frames'][0])
144
>>> array_dict['frames'][0].nbytes
2764800
>>> new_array_dict['frames'][0].nbytes
2764800
Not sure how to fix this issue, but the pd.DataFrame
memory usage report is incorrect because sys.getsizeof
is obtaining the wrong memory usage of the NumPy arrays. I will continue debugging until the culprit is found!