1

I timed two Pandas queries with the hope of achieving much higher speeds with a index. However, the opposite happened. Can someone explain why that is? or whether something I am doing is wrong? My understanding was, a Pandas index works as a hash table and look ups would happen in constant time. As far as row filtering is concerned, I believe it is a sequential filtering where each time a filter is applied, all the rows in the data frame is scanned.

The data set has about 8 million rows and 7 columns. I am trying to filter by a combination of string values in a column in which the data is not unique.

In [1]: import pandas as pd

In [2]: df = pd.read_csv("/path/to/file", header=None, sep='\t', usecols=[0,1,2,3,5,6,7], names=['A', 'B', 'C', 'D', 'E', 'F', 'G'])

In [3]: %timeit -n10 df[df['B'].isin(['S1', 'S2'])]
10 loops, best of 3: 145 ms per loop

In [4]: df.dtypes
Out[4]: 
A       object
B      object
C      int64
D      int64
E      float64
F      float64
G     object
dtype: object

In [5]: df.shape
Out[5]: (8468828, 7)

After indexing:

In [4]: df2 = pd.read_csv("/path/to/file", header=None, sep='\t', usecols=[0,1,2,3,5,6,7], names=['A', 'B', 'C', 'D', 'E', 'F', 'G'])

In [5]: df2.set_index('B', inplace=True)

In [6]: %timeit -n10 df2.loc[['S1', 'S2']]
10 loops, best of 3: 302 ms per loop
c00der
  • 543
  • 1
  • 4
  • 20

2 Answers2

3

@HYRY's explanation on how indices are treated in pandas is informative:

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(log N), when index is random ordered pandas need to check all the keys in the index O(N).

jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Thanks. I checked with a sorted index as well. In fact, it was slower than the unsorted one. I was also shooting for a O(log N) speed. But, repeatedly got better results with a sequential column filtration (I think as @Alex pointed out, due to the cython speeds) – c00der Feb 07 '18 at 20:17
2

.loc is implemented in Python so it's slow.

The first way does two things:

  1. Compute .isin. Here's a link to the path that your code is taking. It relies on the hashtable module which is written in cython (and runs at near c speeds).
  2. Once you've computed a mask, applying it is mostly done with numpy which again means c speeds.

The moral of the story is that sticking in c/cython land is faster than operating in Python land.

Alex
  • 18,484
  • 8
  • 60
  • 80
  • Hi Thanks. I am finding it hard to understand why Pandas indexes exist if they are less efficient than the numpy/cython based column filtering. Is there some other way to get better performance using indexes? – c00der Feb 07 '18 at 18:11
  • 1
    Because they are super convenient. Check out [the docs on indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html) to get a feel for what you can do with an `Index`. Secondly, better performance is always machine/dataset specific. I'd recommend profiling your code and then go after bottlenecks. There are line profilers (e.g. [kernprof](https://github.com/rkern/line_profiler)) and function profilers (e.g. Python's [profile module](https://docs.python.org/3/library/profile.html#module-profile)) that make this pretty straightforward. – Alex Feb 07 '18 at 18:19