6

What is the best way to delete a column without running out of memory in pandas?

I have a large dataset and after some variable manipulation I need to delete about half the variables. I tried using df.drop(vars, axis=1, inplace=True) but discovered that my memory usage shot up quite a bit. Same without the inplace patameter.

This is the exact topic discussed in this old pandas issue thread but it was closed without giving an answer. There are many similar questions on SO but I have not found an answer to this, which is specifically how to avoid a large memory increase when dropping many variables from a large data frame. Thanks!

Mauricio
  • 414
  • 2
  • 11

1 Answers1

0

As already advised in the question mentioned by @Kraigolas, it is not recommended to use inplace for various reasons and in this case it does not even bring benefits.

In general, the drop operation can be burdensome in terms of memory usage if the dataframe is not preprocessed in such a way as to arrive at it in the most optimised form possible.

Cast dtypes

For example, you may decide to cast all columns in the exact datatypes in order to save space (see pandas.DataFrame.dtypes and pandas.DataFrame.astype)

An example using python 3.9, pandas 1.4.3 and numpy 1.23.1, with tracemalloc:

import pandas as pd
import numpy as np
import tracemalloc

df = pd.DataFrame(data=np.ones((10000,10000)))

tracemalloc.start()

df.drop(df.columns[0:1000], axis=1)

print(f"MB peak of RAM: {tracemalloc.get_traced_memory()[1] / 1024 / 1024}")

tracemalloc.stop()

Output will be 687.13 MB.

Now, if you cast dtypes to int instead of default float64 (clearly moving the tracemalloc's start after the preprocessing step, otherwise the peak is affected by the cast), in this way:

df = pd.DataFrame(data=np.ones((10000,10000)))
df = df.astype('int8')

tracemalloc.start()

df.drop(df.columns[0:1000], axis=1)

print(f"MB peak of RAM: {tracemalloc.get_traced_memory()[1] / 1024 / 1024}")

tracemalloc.stop()

Output will be 86.31MB.

Use 'iloc' (if possible)

If it is possible to locate columns by index list (e.g. by reordering them with pandas.DataFrame.sort_values according to some criterion), the operation of pandas.DataFrame.iloc will be considerably more efficient and faster.

Compared to the example before, using dtype casting with int8:

tracemalloc.start()

df.iloc[:, 1000:]

print(f"MB peak of RAM: {tracemalloc.get_traced_memory()[1] / 1024 / 1024}")

tracemalloc.stop()

it is only used 0.023 MB of RAM


In general, vector operations are always more performant (even by a few orders of magnitude) than functions that are simple to use but have time-consuming internal procedures, especially in pandas.

Giuseppe La Gualano
  • 1,491
  • 1
  • 4
  • 24