2
Question:

Hi,

When searching for methods to make a selection of a dataframe (being relatively unexperienced with Pandas), I had the following question:

What is faster for large datasets - .isin() or .query()?

Query is somewhat more intuitive to read, so my preferred approach due to my line of work. However, testing it on a very small example dataset, query seems to be much slower.

Is there anyone who has tested this properly before? If so, what were the outcomes? I searched the web, but could not find another post on this.

See the sample code below, which works for Python 3.8.5.

Thanks a lot in advance for your help!

Code:
# Packages
import pandas as pd
import timeit
import numpy as np


# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
               'owner': ['Canyon', 'Endurace', 'Bike']},
                index=['Frame', 'Type', 'Kind'])

# Show dataframe
df

# Create filter
selection = ['Canyon']

# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)] 

%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]

%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


# Filter dataframe using 'query'
df_filtered = df.query("owner in @selection")

%timeit df_filtered = df.query("owner in @selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Hedge92
  • 543
  • 5
  • 9

2 Answers2

2

The best test in real data, here fast comparison for 3k, 300k,3M rows with this sample data:

selection = ['Hedge']

df = pd.concat([df] * 1000, ignore_index=True)
In [139]: %timeit df[df['owner'].isin(selection)]
449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [140]: %timeit df.query("owner in @selection")
1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df = pd.concat([df] * 100000, ignore_index=True)
In [142]: %timeit df[df['owner'].isin(selection)]
8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [143]: %timeit df.query("owner in @selection")
13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

df = pd.concat([df] * 1000000, ignore_index=True)
In [145]: %timeit df[df['owner'].isin(selection)]
94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [146]: %timeit df.query("owner in @selection")
112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

If check docs:

DataFrame.query() using numexpr is slightly faster than Python for large frames

Conclusion - The best test in real data, because depends of number of rows, number of matched values and also by length of list selection.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Just a note. It also probably depends on the size of `selection`. What if the selection contains 100 or even 1k items? So the faster approach probably will be `selection = set(items); df[df['owner'].apply(lambda x: x in selection)]` – Alexander Volkovsky Jun 04 '21 at 10:19
  • @AlexanderVolkovsky - Ya, absolutely agree it depends n the size of `selection`, so the best test in real data. Btw, `df['owner'].apply(lambda x: x in selection)` is slowier, because loops – jezrael Jun 04 '21 at 10:22
  • 1
    because of function calls) yes, it's definitively slower on the small selection size. but the search complexity in set is always O(1) so it can neutralize the cost of a function call after a certain sample size – Alexander Volkovsky Jun 04 '21 at 10:26
  • @AlexanderVolkovsky - hmmm, is possible prove it by timings? – jezrael Jun 04 '21 at 10:27
  • just to prove. https://pastebin.com/am2zAHJf. On my laptop the second option works slightly faster than `.isin`. But there is a lot of nuances of course. – Alexander Volkovsky Jun 04 '21 at 11:08
  • @jezrael and @AlexanderVolkovsky, thanks for your observations! Definitely makes sense that `query` is slower initially due to more function calls, but that the disadvantage decreases with larger sample sizes. As I expect to work with relatively large sample sizes, I won't shy away from `query` - however, I will also compare performance with `isin` when performance becomes more important for my code. – Hedge92 Jun 04 '21 at 11:35
  • 1
    @Hedge92, i believe the fastest version will be `df[np.isin(df['owner'].values, selection)]`. actually pandas has a huge overhead. – Alexander Volkovsky Jun 04 '21 at 12:08
  • @AlexanderVolkovsky will definitely try that solution later, will get back to you on that! – Hedge92 Jun 04 '21 at 13:38
2

A perfplot over some generated data:

benchmark1

Assuming some hypothetical data, as well as a proportionally increasing selection size (10% of frame size).

Sample data for n=10:

df:

       name  owner
0  Constant  JoVMq
1  Constant  jiKNB
2  Constant  WEqhm
3  Constant  pXNqB
4  Constant  SnlbV
5  Constant  Euwsj
6  Constant  QPPbs
7  Constant  Nqofa
8  Constant  qeUKP
9  Constant  ZBFce

Selection:

['ZBFce']

Performance reflects the docs. At smaller frames the overhead of query is significant over isin However, at frames around 200k rows the performance is comparable to isin and at frames around 10m rows query starts to become more performant.

I agree with @jezrael that, this is, as with most pandas runtime problems, very data dependent, and the best test would be to test on real datasets for a given use case and make a decision based on that.


Edit: Included @AlexanderVolkovsky's suggestion to convert selection to a set and use apply + in:

bench 2


Perfplot Code:

import string

import numpy as np
import pandas as pd
import perfplot

charset = list(string.ascii_letters)

np.random.seed(5)


def gen_data(n):
    df = pd.DataFrame({'name': 'Constant',
                       'owner': [''.join(np.random.choice(charset, 5))
                                 for _ in range(n)]})
    selection = df['owner'].sample(frac=.1).tolist()
    return df, selection, set(selection)


def test_isin(params):
    df, selection, _ = params
    return df[df['owner'].isin(selection)]


def test_query(params):
    df, selection, _ = params
    return df.query("owner in @selection")


def test_apply_over_set(params):
    df, _, set_selection = params
    return df[df['owner'].apply(lambda x: x in set_selection)]


if __name__ == '__main__':
    out = perfplot.bench(
        setup=gen_data,
        kernels=[
            test_isin,
            test_query,
            test_apply_over_set
        ],
        labels=[
            'test_isin',
            'test_query',
            'test_apply_over_set'
        ],
        n_range=[2 ** k for k in range(25)],
        equality_check=None
    )
    out.save('perfplot_results.png', transparent=False)
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Thanks for your help! The visualization is a good way to support your point, thanks for that. Will consider both `query` and `asin`, but keeping in mind that `query` might outperform `asin` after a certain sample size. Also thanks for sharing a link to the documentation. – Hedge92 Jun 04 '21 at 11:38
  • @Hedge92, Please, include the comparison with `df[np.isin(df['owner'].values, selection)]` – Alexander Volkovsky Jun 04 '21 at 14:07