I'm trying to work out how to free memory by dropping columns.
import numpy as np
import pandas as pd
big_df = pd.DataFrame(np.random.randn(100000,20))
big_df.memory_usage().sum()
> 16000128
Now there are various ways of getting a subset of the columns copied into a new dataframe. Let's look at the memory usage of a few of them.
small_df = big_df[[0, 1]]
small_df.memory_usage().sum()
> 1600128
small_df_filtered = big_df.filter([0, 1], axis='columns')
small_df_filtered.memory_usage().sum()
> 1600128
small_df_copied = big_df[[0, 1]].copy()
small_df_copied.memory_usage().sum()
> 1600128
small_df_dropped = big_df.drop([0, 1], axis='columns')
small_df_dropped.memory_usage().sum()
> 14400128
small_df_dropped = big_df.drop([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], axis='columns')
small_df_dropped.memory_usage().sum()
> 1600128
Adding deep=True
does not change the results.
Adding del big_df
after making the copies does not change the results.
So none of these smaller copies occupy less memory than the original dataframe, and even stranger dropping 18 columns keeps the memory the same, but dropping 2 increases the memory.
What is going on? Are these really copies? If so why aren't they smaller than the original?