I timed two Pandas queries with the hope of achieving much higher speeds with a index. However, the opposite happened. Can someone explain why that is? or whether something I am doing is wrong? My understanding was, a Pandas index works as a hash table and look ups would happen in constant time. As far as row filtering is concerned, I believe it is a sequential filtering where each time a filter is applied, all the rows in the data frame is scanned.
The data set has about 8 million rows and 7 columns. I am trying to filter by a combination of string values in a column in which the data is not unique.
In [1]: import pandas as pd
In [2]: df = pd.read_csv("/path/to/file", header=None, sep='\t', usecols=[0,1,2,3,5,6,7], names=['A', 'B', 'C', 'D', 'E', 'F', 'G'])
In [3]: %timeit -n10 df[df['B'].isin(['S1', 'S2'])]
10 loops, best of 3: 145 ms per loop
In [4]: df.dtypes
Out[4]:
A object
B object
C int64
D int64
E float64
F float64
G object
dtype: object
In [5]: df.shape
Out[5]: (8468828, 7)
After indexing:
In [4]: df2 = pd.read_csv("/path/to/file", header=None, sep='\t', usecols=[0,1,2,3,5,6,7], names=['A', 'B', 'C', 'D', 'E', 'F', 'G'])
In [5]: df2.set_index('B', inplace=True)
In [6]: %timeit -n10 df2.loc[['S1', 'S2']]
10 loops, best of 3: 302 ms per loop