12

I often need to filter pandas dataframe df by df[df['col_name']=='string_value'], and I want to speed up the row selction operation, is there a quick way to do that ?

For example,

In [1]: df = mul_df(3000,2000,3).reset_index()

In [2]: timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 1.52 s per loop

Can 1.52s be shorten ?

Note:

mul_df() is function to create multilevel dataframe:

>>> mul_df(4,2,3)
                 COL000  COL001  COL002
STK_ID RPT_Date                        
A0000  B000      0.6399  0.0062  1.0022
       B001     -0.2881 -2.0604  1.2481
A0001  B000      0.7070 -0.9539 -0.5268
       B001      0.8860 -0.5367 -2.4492
A0002  B000     -2.4738  0.9529 -0.9789
       B001      0.1392 -1.0931 -0.2077
A0003  B000     -1.1377  0.5455 -0.2290
       B001      1.0083  0.2746 -0.3934

Below is the code of mul_df():

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst
bigbug
  • 55,954
  • 42
  • 77
  • 96
  • 3
    Why not keeping it as an index (or if it is not an index, setting it as an index)? Selecting on the index is much faster (`df.ix['A0003']`): for me 194 us vs 646 ms. – joris May 05 '13 at 12:53
  • That is just a sample. Many times, we need to select by column instead of index; setting it as an index is an option, but it introduces other problem(index exist already, change dataframe structure etc.) (If we treate dataframe as a database table,It is very common to: SELECT * FROM table_name WHERE col_name='string_value' , isn't it ? ) – bigbug May 05 '13 at 13:05

3 Answers3

22

I have long wanted to add binary search indexes to DataFrame objects. You can take the DIY approach of sorting by the column and doing this yourself:

In [11]: df = df.sort('STK_ID') # skip this if you're sure it's sorted

In [12]: df['STK_ID'].searchsorted('A0003', 'left')
Out[12]: 6000

In [13]: df['STK_ID'].searchsorted('A0003', 'right')
Out[13]: 8000

In [14]: timeit df[6000:8000]
10000 loops, best of 3: 134 µs per loop

This is fast because it always retrieves views and does not copy any data.

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • Ituitive solution, thanks. Hopefully it can be imbedded in pandas row selection sytax such as "df[df['col_name']=='string_value']" – bigbug May 07 '13 at 01:56
  • 4
    The `sort` method is deprecated in Pandas 19.2 and up. Use the `sort_values` method instead. – Jeff Nov 13 '17 at 19:40
  • This answer is severely underrated. Several similar answers on SO suggest using in1d or the OPs original method, but this saved around 23 seconds over in1d creating over 1200 snapshots of data on a 1M+ row dataframe (in1d takes ~24.5 seconds, this method takes ~1.5 seconds) – heretomurimudamura Jan 16 '21 at 09:55
  • This is an awesome answer! In my particular case I switched from using "np.where" that was according to https://stackoverflow.com/a/44931669 the fastest method to this and the overall time went down from about 11000s to 2000s. On my 40M+ rows it added a one time sorting time of 45s. – VMQ May 21 '22 at 22:12
7

Somewhat surprisingly, working with the .values array instead of the Series is much faster for me:

>>> time df = mul_df(3000, 2000, 3).reset_index()
CPU times: user 5.96 s, sys: 0.81 s, total: 6.78 s
Wall time: 6.78 s
>>> timeit df[df["STK_ID"] == "A0003"]
1 loops, best of 3: 841 ms per loop
>>> timeit df[df["STK_ID"].values == "A0003"]
1 loops, best of 3: 210 ms per loop
DSM
  • 342,061
  • 65
  • 592
  • 494
4

Depending on what you want to do with the selection afterwards, and if you have to make multiple selections of this kind, the groupby functionality can also make things faster (at least with the example).

Even if you only have to select the rows for one string_value, it is a little bit faster (but not much):

In [11]: %timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 626 ms per loop

In [12]: %timeit df.groupby("STK_ID").get_group("A0003")
1 loops, best of 3: 459 ms per loop

But subsequent calls to the GroupBy object will be very fast (eg to select the rows of other sting_values):

In [25]: grouped = df.groupby("STK_ID")

In [26]: %timeit grouped.get_group("A0003")
1 loops, best of 3: 333 us per loop
joris
  • 133,120
  • 36
  • 247
  • 202