Filtering large dataframes by unique indices with Pandas

Question

I have a 30 million row by 30 column dataframe that I want to filter by using a list of unique indices.

Basically the input would be:

df = pd.DataFrame({'column':[0,1,2,3,4,5,6,7,8,9,10]})
indices = [1, 7, 10]

df_filtered = df[df.index.isin(indices)]

With the output being:

df_filtered

column
1
7
10

This works well with 'manageable' dataframes but when trying to match a (30,000,000, 30) dataframe with a list of ~33,000 unique indices this runs me into a local MemoryError.

Is there a way I can parallelize this process or break it into pieces more efficiently?

Possible duplicate of ["Large data" work flows using pandas](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) — Erfan, Oct 15 '19 at 21:39

score 2 · Answer 1 · answered Oct 15 '19 at 21:32

The actual answer depends on what you want to do with the DataFrame, but a general idea when running into memory errors is to do the operation in chunks.

In your case, chunks of size N are N sequential elements from the indices list:

df = pd.DataFrame()  # placeholder for your huge dataframe
indices = []  # placeholder for your long list of indices

chunksize = 50  # size of each chunk (50 rows)

for chunk in range(0, len(indices), chunksize):
    current_indices = indices[chunk:chunk+chunksize]
    df_filtered = df[df.index.isin(current_indices)]
    # do what you want with df_filtered here

Filtering large dataframes by unique indices with Pandas

1 Answers1