I have a 30 million row by 30 column dataframe that I want to filter by using a list of unique indices.
Basically the input would be:
df = pd.DataFrame({'column':[0,1,2,3,4,5,6,7,8,9,10]})
indices = [1, 7, 10]
df_filtered = df[df.index.isin(indices)]
With the output being:
df_filtered
column
1
7
10
This works well with 'manageable' dataframes but when trying to match a (30,000,000, 30) dataframe with a list of ~33,000 unique indices this runs me into a local MemoryError
.
Is there a way I can parallelize this process or break it into pieces more efficiently?