0

I have a 30 million row by 30 column dataframe that I want to filter by using a list of unique indices.

Basically the input would be:

df = pd.DataFrame({'column':[0,1,2,3,4,5,6,7,8,9,10]})
indices = [1, 7, 10]

df_filtered = df[df.index.isin(indices)]

With the output being:

df_filtered

column
1
7
10

This works well with 'manageable' dataframes but when trying to match a (30,000,000, 30) dataframe with a list of ~33,000 unique indices this runs me into a local MemoryError.

Is there a way I can parallelize this process or break it into pieces more efficiently?

user3666197
  • 1
  • 6
  • 50
  • 92
HelloToEarth
  • 2,027
  • 3
  • 22
  • 48

1 Answers1

2

The actual answer depends on what you want to do with the DataFrame, but a general idea when running into memory errors is to do the operation in chunks.

In your case, chunks of size N are N sequential elements from the indices list:

df = pd.DataFrame()  # placeholder for your huge dataframe
indices = []  # placeholder for your long list of indices

chunksize = 50  # size of each chunk (50 rows)

for chunk in range(0, len(indices), chunksize):
    current_indices = indices[chunk:chunk+chunksize]
    df_filtered = df[df.index.isin(current_indices)]
    # do what you want with df_filtered here
jfaccioni
  • 7,099
  • 1
  • 9
  • 25