Increase in memory usage on pandas dataframe creation

Question

I have a piece of code which receives call-back from another function and creates a list of list (pd_arr). This list is then used to create a data frame. Finally the list of list is deleted.

On profiling using memory-profiler, this is the output

102.632812 MiB   0.000000 MiB       init()
236.765625 MiB 134.132812 MiB           add_to_list()
                                    return pd.DataFrame()
394.328125 MiB 157.562500 MiB       pd_df = pd.DataFrame(pd_arr, columns=df_columns)
350.121094 MiB -44.207031 MiB       pd_df = pd_df.set_index(df_columns[0])
350.292969 MiB   0.171875 MiB       pd_df.memory_usage()
350.328125 MiB   0.035156 MiB       print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0]), sys.getsizeof(pd_df), len(pd_arr)
350.328125 MiB   0.000000 MiB       del pd_arr

On checking deep memory usage of pd_df (data frame), it is 80.5 MB. So, my question is why does the memory not decrement after del pd_arr line.

Also, total data frame size as per profiler (157 - 44 = 110 MB) seems to be more than 80 MB. So, what causes the difference?

Also, is there any other memory-efficient way to create data frame (data received in loop) which is not too bad in time performance (For eg: increment of 10s of seconds should be fine for data-frame of size 100MB)?

Edit: Simple python script which explains this behaviour

Filename: py_test.py

Line #    Mem usage    Increment   Line Contents
================================================
     9    102.0 MiB      0.0 MiB   @profile
    10                             def setup():
    11                              global arr, size
    12    102.0 MiB      0.0 MiB    arr = range(1, size)
    13    131.2 MiB     29.1 MiB    arr = [x+1 for x in arr]


Filename: py_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    21    131.2 MiB      0.0 MiB   @profile
    22                             def tearDown():
    23                              global arr
    24    131.2 MiB      0.0 MiB    del arr[:]
    25    131.2 MiB      0.0 MiB    del arr
    26     93.7 MiB    -37.4 MiB    gc.collect()

On introducing dataframe,

Filename: py_test.py

Line #    Mem usage    Increment   Line Contents
================================================
     9    102.0 MiB      0.0 MiB   @profile
    10                             def setup():
    11                              global arr, size
    12    102.0 MiB      0.0 MiB    arr = range(1, size)
    13    132.7 MiB     30.7 MiB    arr = [x+1 for x in arr]


Filename: py_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    15    132.7 MiB      0.0 MiB   @profile
    16                             def dfCreate():
    17                              global arr
    18    147.1 MiB     14.4 MiB    pd_df = pd.DataFrame(arr)
    19    147.1 MiB      0.0 MiB    return pd_df


Filename: py_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    21    147.1 MiB      0.0 MiB   @profile
    22                             def tearDown():
    23                              global arr
    24                              #del arr[:]
    25    147.1 MiB      0.0 MiB    del arr
    26    147.1 MiB      0.0 MiB    gc.collect()

Are you completely sure there is not a reference to `pd_arr` anywhere else in the code? Python is reference-counted, so using `del` will only free the associated memory if it can be ensured that the deleted object cannot be used from anywhere. You can also try to [clear the list](https://stackoverflow.com/questions/1400608/how-to-empty-a-list-in-python). — jdehesa, Jan 27 '17 at 12:29
I tried using `del pd_arr[:]`. No memory is reduced. pd_arr is defined as global in the code. Would that make a difference? — Rajs123, Jan 27 '17 at 13:02
Well `del pd_arr` just means that you cannot use the name `pd_arr` to refer to that list anymore, whether global or not, but if at some previous point there was something like `a = pd_arr` (although it could be something much more subtle, like passing `pd_arr` to a function and having its reference copied somewhere else), then it will not be really deleted. However, I cannot explain why `del pd_arr[:]` does not make any difference. — jdehesa, Jan 27 '17 at 13:08

Vladimir Ignatev · Answer 1 · 2017-02-02T13:21:12.510

Answering your first question, when you try to clean out memory using del pd_arr actually this doesn't happen because DataFrame stores one link to pd_arr, and top scope keeps one more link; decreasing refcounter won't collect memory, because this memory is under use.

You may check my assumption by running sys.getrefcount(pd_arr) before del pd_arr and you will get 2 as a result.

Now, I believe that the following code snippet does the same what you're trying to do: https://gist.github.com/vladignatyev/ec7a26b7042efd6f710d436afbfb87de/90df8cc6bbb8bd0cb3a1d2670e03aff24f3a5b24

If you try this snippet, you will see the memory usage as follows:

Line #    Mem usage    Increment   Line Contents
================================================
    13   63.902 MiB    0.000 MiB   @profile
    14                             def to_profile():
    15  324.828 MiB  260.926 MiB       pd_arr = make_list()
    16                                 # pd_df = pd.DataFrame.from_records(pd_arr, columns=[x for x in range(0,1000)])
    17  479.094 MiB  154.266 MiB       pd_df = pd.DataFrame(pd_arr)
    18                                 # pd_df.info(memory_usage='deep')
    19  479.094 MiB    0.000 MiB       print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0])
    20  481.055 MiB    1.961 MiB       print sys.getsizeof(pd_df), len(pd_arr)
    21  481.055 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
    22  417.090 MiB  -63.965 MiB       del pd_arr
    23  323.090 MiB  -94.000 MiB       gc.collect()

Try this example:

@profile
def test():
    a = [x for x in range(0,100000)]
    del a


aa = test()

You will get exactly what you expect:

Line #    Mem usage    Increment   Line Contents
================================================
     6   64.117 MiB    0.000 MiB   @profile
     7                             def test():
     8   65.270 MiB    1.152 MiB       a = [x for x in range(0,100000)]
     9                                 # print sys.getrefcount(a)
    10   64.133 MiB   -1.137 MiB       del a
    11   64.133 MiB    0.000 MiB       gc.collect()

Also, if you call sys.getrefcount(a), the memory sometimes will be cleaned before del a:

Line #    Mem usage    Increment   Line Contents
================================================
     6   63.828 MiB    0.000 MiB   @profile
     7                             def test():
     8   65.297 MiB    1.469 MiB       a = [x for x in range(0,100000)]
     9   64.230 MiB   -1.066 MiB       print sys.getrefcount(a)
    10   64.160 MiB   -0.070 MiB       del a

But things go wild when you use pandas.

If you open the source code of pandas.DataFrame, you will see, that in the case when you initialize DataFrame with list, pandas creates new NumPy array and copies it's content. Check this out: https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L329

Deleting pd_arr won't free the memory, because pd_arr will be collected after DataFrame creation and exiting your function anyway, since it doesn't have any additional links to it. getrefcount call before and after proves this.

Creating new DataFrame from plain list make your list copied with NumPy Array. (Look at np.array(data, dtype=dtype, copy=copy) and the corresponding documentation about array) Copying operation may affect the time of execution, because allocating new memory block is a heavy operation.

I've tried to initialize new DataFrame with Numpy array instead. The only difference is where numpy.Array memory overhead appears. Compare the following two snippets:

def make_list():  # 1
    pd_arr = []
    for i in range(0,10000):
        pd_arr.append([x for x in range(0,1000)])
    return np.array(pd_arr)

and

def make_list():  #2
    pd_arr = []
    for i in range(0,10000):
        pd_arr.append([x for x in range(0,1000)])
    return pd_arr

Number #1 (creating DataFrame doesn't produce Memory Usage Overhead!):

Line #    Mem usage    Increment   Line Contents
================================================
    14   63.672 MiB    0.000 MiB   @profile
    15                             def to_profile():
    16  385.309 MiB  321.637 MiB       pd_arr = make_list()
    17  385.309 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
    18  385.316 MiB    0.008 MiB       pd_df = pd.DataFrame(pd_arr)
    19  385.316 MiB    0.000 MiB       print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0])
    20  386.934 MiB    1.617 MiB       print sys.getsizeof(pd_df), len(pd_arr)
    21  386.934 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
    22  386.934 MiB    0.000 MiB       del pd_arr
    23  305.934 MiB  -81.000 MiB       gc.collect()

Number #2 (over 100Mb overhead due to copying of array)!:

Line #    Mem usage    Increment   Line Contents
================================================
    14   63.652 MiB    0.000 MiB   @profile
    15                             def to_profile():
    16  325.352 MiB  261.699 MiB       pd_arr = make_list()
    17  325.352 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
    18  479.633 MiB  154.281 MiB       pd_df = pd.DataFrame(pd_arr)
    19  479.633 MiB    0.000 MiB       print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0])
    20  481.602 MiB    1.969 MiB       print sys.getsizeof(pd_df), len(pd_arr)
    21  481.602 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
    22  417.621 MiB  -63.980 MiB       del pd_arr
    23  330.621 MiB  -87.000 MiB       gc.collect()

So, initialize DataFrame only with Numpy Array, not a list. It is better from the memory consumption perspective and probably faster, because it doesn't require additional memory allocation call.

Hopefully, now I've answered all of your questions.

I tried to compute ref counts before and after dataframe creation. [Code-link](https://gist.github.com/Rajlaxmi/eae3708fc1fd0d6cca24d6003be61539) . Ref count was 2 before and after data frame creation. — Rajs123, Feb 02 '17 at 11:35
It's okay that ref count is 2. >>> import sys >>> a = [1,2,3] >>> print sys.getrefcount(a) 2 — Vladimir Ignatev, Feb 02 '17 at 12:43
@Rajs123 please check my answer! You have to prefer np.array over list during the DataFrame creation because it's faster and doesn't require additional data copying, that occur in `pandas` internals (proof is in the answer). — Vladimir Ignatev, Feb 02 '17 at 13:17
Thanks a lot! This was everything I needed. Also, Thanks for pointing out the code. Will be using it a lot now :) — Rajs123, Feb 02 '17 at 18:47

Increase in memory usage on pandas dataframe creation

1 Answers1