Pandas Index Filter is Slower than Non-Index Column Filter

Question

I timed two Pandas queries with the hope of achieving much higher speeds with a index. However, the opposite happened. Can someone explain why that is? or whether something I am doing is wrong? My understanding was, a Pandas index works as a hash table and look ups would happen in constant time. As far as row filtering is concerned, I believe it is a sequential filtering where each time a filter is applied, all the rows in the data frame is scanned.

The data set has about 8 million rows and 7 columns. I am trying to filter by a combination of string values in a column in which the data is not unique.

In [1]: import pandas as pd

In [2]: df = pd.read_csv("/path/to/file", header=None, sep='\t', usecols=[0,1,2,3,5,6,7], names=['A', 'B', 'C', 'D', 'E', 'F', 'G'])

In [3]: %timeit -n10 df[df['B'].isin(['S1', 'S2'])]
10 loops, best of 3: 145 ms per loop

In [4]: df.dtypes
Out[4]: 
A       object
B      object
C      int64
D      int64
E      float64
F      float64
G     object
dtype: object

In [5]: df.shape
Out[5]: (8468828, 7)

After indexing:

In [4]: df2 = pd.read_csv("/path/to/file", header=None, sep='\t', usecols=[0,1,2,3,5,6,7], names=['A', 'B', 'C', 'D', 'E', 'F', 'G'])

In [5]: df2.set_index('B', inplace=True)

In [6]: %timeit -n10 df2.loc[['S1', 'S2']]
10 loops, best of 3: 302 ms per loop

Does `df['B']` have repeated values, or are they unique? – jpp Feb 07 '18 at 17:08 — jpp, Feb 07 '18 at 17:08
They are not unique. Updated the question. – c00der Feb 07 '18 at 20:01 — c00der, Feb 07 '18 at 20:01

score 3 · Answer 1 · answered Feb 07 '18 at 20:10

3

@HYRY's explanation on how indices are treated in pandas is informative:

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(log N), when index is random ordered pandas need to check all the keys in the index O(N).

answered Feb 07 '18 at 20:10

jpp

159,742
34
281
339

1

Thanks. I checked with a sorted index as well. In fact, it was slower than the unsorted one. I was also shooting for a O(log N) speed. But, repeatedly got better results with a sequential column filtration (I think as @Alex pointed out, due to the cython speeds) – c00der Feb 07 '18 at 20:17

score 2 · Accepted Answer · answered Feb 07 '18 at 17:21

2

.loc is implemented in Python so it's slow.

The first way does two things:

Compute .isin. Here's a link to the path that your code is taking. It relies on the hashtable module which is written in cython (and runs at near c speeds).
Once you've computed a mask, applying it is mostly done with numpy which again means c speeds.

The moral of the story is that sticking in c/cython land is faster than operating in Python land.

answered Feb 07 '18 at 17:21

Alex

18,484
8
60
80

Hi Thanks. I am finding it hard to understand why Pandas indexes exist if they are less efficient than the numpy/cython based column filtering. Is there some other way to get better performance using indexes? – c00der Feb 07 '18 at 18:11
1

Because they are super convenient. Check out [the docs on indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html) to get a feel for what you can do with an `Index`. Secondly, better performance is always machine/dataset specific. I'd recommend profiling your code and then go after bottlenecks. There are line profilers (e.g. [kernprof](https://github.com/rkern/line_profiler)) and function profilers (e.g. Python's [profile module](https://docs.python.org/3/library/profile.html#module-profile)) that make this pretty straightforward. – Alex Feb 07 '18 at 18:19

Pandas Index Filter is Slower than Non-Index Column Filter

2 Answers2