Confused by pandas DataFrame memory_usage and copies

Question

I'm trying to work out how to free memory by dropping columns.

import numpy as np
import pandas as pd

big_df = pd.DataFrame(np.random.randn(100000,20))
big_df.memory_usage().sum()

> 16000128

Now there are various ways of getting a subset of the columns copied into a new dataframe. Let's look at the memory usage of a few of them.

small_df = big_df[[0, 1]]
small_df.memory_usage().sum()

> 1600128

small_df_filtered = big_df.filter([0, 1], axis='columns')
small_df_filtered.memory_usage().sum()

> 1600128

small_df_copied = big_df[[0, 1]].copy()
small_df_copied.memory_usage().sum()

> 1600128

small_df_dropped = big_df.drop([0, 1], axis='columns')
small_df_dropped.memory_usage().sum()

> 14400128

small_df_dropped = big_df.drop([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], axis='columns')
small_df_dropped.memory_usage().sum()

> 1600128

Adding deep=True does not change the results.

Adding del big_df after making the copies does not change the results.

So none of these smaller copies occupy less memory than the original dataframe, and even stranger dropping 18 columns keeps the memory the same, but dropping 2 increases the memory.

What is going on? Are these really copies? If so why aren't they smaller than the original?

There is one less digit in the first slice. On copy, the size remains same as the sliced dataframe. — Vishnudev Krishnadas, Jan 16 '20 at 16:31
Brilliant, at the risk of making me look an idiot you can write that up as an answer @Vishnudev — dumbledad, Jan 16 '20 at 16:33

score 2 · Answer 1 · answered Jan 16 '20 at 16:28

From this other question:

Memory is not released when taking a small slice of a DataFrame

"As @Alex noted, slicing a dataframe only gives you a view to the original frame, but does not delete it; you need to use .copy() for that. However, even when I used .copy(), memory usage grew and grew and grew, albeit at a slower rate.

I suspect that this has to do with how Python, numpy and pandas use memory. A dataframe is not a single object in memory; it contains pointers to other objects (especially, in this particular case, to strings, which is the "flags" column). When the dataframe is freed, and these objects are freed, the reclaimed free memory space can be fragmented. Later, when a huge new dataframe is created, it might not be able to use the fragmented space, and new space might need to be allocated. The details depend on many little things, such as the Python, numpy and pandas versions, and the particulars of each case."

score 2 · Accepted Answer · answered Jan 16 '20 at 16:36

Original DataFrame

big_df = pd.DataFrame(np.random.randn(100000,20))
big_df.memory_usage().sum()

Memory Usage: 16000128 i.e. 1.6x10^7

On slice,

small_df = big_df[[0, 1]]
small_df.memory_usage().sum()

Memory Usage: 1600128 i.e. 1.6x10^6

There is one less digit in the slice. On copy, the size remains same as the sliced dataframe.

On dropping columns 0 and 1 all other columns remain, hence the memory shoots up

small_df_dropped = big_df.drop([0, 1], axis='columns')
small_df_dropped.memory_usage().sum()

Memory Usage: 14400128 i.e. 1.4x10^7

Confused by pandas DataFrame memory_usage and copies

2 Answers2