How to speed up row selection by column value for big Pandas dataframe

Question

I have a big numeric Pandas dataframe df, and I want to select out the rows whose certain column's value is within the range of min_value and max_value.

I can do this by:

filtered_df = df[(df[col_name].values >= min_value) & (df[col_name].values <= max_value)]

And I am looking for methods to speed it up . I try below:

df.sort(col_name, inplace=True)
left_idx = np.searchsorted(df[col_name].values, min_value, side='left')
right_idx = np.searchsorted(df[col_name].values, max_value, side='right')
filtered_df = df[left_idx:right_idx]

But it does not work for df.sort() costs more time.

So, any tips to speed up the selection ?

(Pandas 0.11)

Your initial query could be rewritten as `filtered_df = df[(df[col_name] >= min_value) & (df[col_name] <= max_value)]` and I think this is correct way of doing it. It should work in linear (O(N) ) time versus sort which (generally) will work in O(N*logN). OTOH if you doing such a selections multiple times, you can do sort just once and then use sorted DF to make selections. — Roman Pekar, Nov 26 '13 at 05:28
@RomanPekar: I think going via `.values` will be faster as there's no index overhead. — DSM, Nov 26 '13 at 05:30
@DSM thanks, that's why I didn't add an answer :) Still learning things about pandas / numpy. Do you think there's a way to speed up OP task? — Roman Pekar, Nov 26 '13 at 05:33
I was hoping the new `.query` functionality would, but it doesn't seem to. Hopefully Jeff or Andy can suggest something. — DSM, Nov 26 '13 at 05:36
@DSM In messing around I haven't got query working faster yet either, but I think it ought too (I thought this is a primary use case for it). +1 on the .values being faster... How "big" are we talking here? — Andy Hayden, Nov 26 '13 at 05:42
@AndyHayden， the multilevel df I am dealing with is around (10000x50)x500, as i am working on a somewhat outdated machine that the performance bottleneck is easy to be noticed. — bigbug, Nov 26 '13 at 06:15
@bibgug can you post what the index looks like? and ``df.info()``? do you have a lot of memory on this machine? — Jeff, Nov 26 '13 at 11:34
@Jeff, pls refer to this http://stackoverflow.com/questions/17390886/how-to-speed-up-pandas-multilevel-dataframe-sum — bigbug, Nov 26 '13 at 13:29
ok.....I think this is about as good as you can get actually. The bottleneck is not the inner computation, but the indexing (e.g. you have a boolean array and are getting the resulting frame). 0.13 will be a bit faster on this. (the other question is when you are using different frames, then the computation itself is the bottleneck). — Jeff, Nov 26 '13 at 13:45
So any hope to speed up dataframe.sort()? That might avoid selection by boolean array. — bigbug, Nov 26 '13 at 13:57
if you have a disjoint result at all then it still would result in a boolean array; and sort is O(n*log n), which is slower. Why is this the bottleneck for you? what are you doing with this result — Jeff, Nov 26 '13 at 14:26
Thanks for the answer. Not a serious bottleneck for me now as sorted dataframe can be repeatly used for many selections. — bigbug, Nov 26 '13 at 14:41

JaminSore · Answer 1 · 2013-11-26T18:28:39.930

6

I think your best bet is to use numexpr to speed this up

import pandas as pd
import numpy as np
import numexpr as ne

data = np.random.normal(size=100000000)
df = pd.DataFrame(data=data, columns=['col'])
a = df['col']
min_val = a.min()
max_val = a.max()
expr = '(a >= min_val) & (a <= max_val)'

And the timings ...

%timeit eval(expr)
1 loops, best of 3: 668 ms per loop

%timeit ne.evaluate(expr)
1 loops, best of 3: 197 ms per loop

edited Nov 26 '13 at 18:28

answered Nov 26 '13 at 06:00

JaminSore

3,758
1
25
21

1

This is what df.query (in 0.13) is suppose to do under the hood :) – Andy Hayden Nov 26 '13 at 06:03
So I guess "upgrade" would be the better answer. – JaminSore Nov 26 '13 at 06:18

How to speed up row selection by column value for big Pandas dataframe

1 Answers1