Efficient masking on sorted dataframe column

Question

Say I have a dataframe animals with a column called weight. I want to perform several filtering operations of the form

mask = animals['weight'] == 123
animals_filtered = animals[mask]

or

mask = animals['weight'] <= 123
animals_filtered = animals[mask]

as efficiently as possible.

Would it help me in any way to preprocess my dataframe by sorting it according to the values in the weight column?

animals.sort_values(by='weight', inplace=True)

If yes, how can I take advantage of the fact that my column is sorted afterwards when performing the filtering operations?

Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. — itprorh66, Aug 19 '23 at 19:13
I don't think the sort would make the mask operation faster but you can use a binary search after the sort to find your data as per this post: https://stackoverflow.com/questions/59490045/how-do-i-binary-search-a-pandas-dataframe-for-a-combination-of-column-values — SR3142, Aug 19 '23 at 19:42
The whole operation would then be greatly dominated by the time taken just for the sort which is much more computationally intensive than filtering. With this kind of simple DF filtering on a numerical column you will typically process say 10M rows in a few milli-seconds - are you sure you really have an efficiency issue? — user19077881, Aug 19 '23 at 20:57
Can you define exactly "*several filtering operations*"? Is it just the two of the example? 10? 1000? How big is your input DataFrame? Sorting is relatively expensive (`O(n*log(n))`) so it's really inefficient compared to two simple filters. If you want to compute 1000 filters then you have to define exactly what. Is it even useful to generate so many DataFrames? What are you doing with those? — mozway, Aug 19 '23 at 21:50

score 1 · Answer 1 · answered Aug 20 '23 at 20:09

1

You may want to do the filtering in the following way:

mask = animals['weight'].values == 123
mask2 = animals['height'].values == 12
animals_filtered = animals[mask & mask2]

This approach is suggested by the extensive experiments from here.

answered Aug 20 '23 at 20:09

sophros

1 Answers1