60

From the pandas documentation, I've gathered that unique-valued indices make certain operations efficient, and that non-unique indices are occasionally tolerated.

From the outside, it doesn't look like non-unique indices are taken advantage of in any way. For example, the following ix query is slow enough that it seems to be scanning the entire dataframe

In [23]: import numpy as np
In [24]: import pandas as pd
In [25]: x = np.random.randint(0, 10**7, 10**7)
In [26]: df1 = pd.DataFrame({'x':x})
In [27]: df2 = df1.set_index('x', drop=False)
In [28]: %timeit df2.ix[0]
1 loops, best of 3: 402 ms per loop
In [29]: %timeit df1.ix[0]
10000 loops, best of 3: 123 us per loop

(I realize the two ix queries don't return the same thing -- it's just an example that calls to ix on a non-unique index appear much slower)

Is there any way to coax pandas into using faster lookup methods like binary search on non-unique and/or sorted indices?

smci
  • 32,567
  • 20
  • 113
  • 146
ChrisB
  • 4,628
  • 7
  • 29
  • 41

2 Answers2

112

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

You can call sort_index method:

import numpy as np
import pandas as pd
x = np.random.randint(0, 200, 10**6)
df1 = pd.DataFrame({'x':x})
df2 = df1.set_index('x', drop=False)
df3 = df2.sort_index()
%timeit df1.loc[100]
%timeit df2.loc[100]
%timeit df3.loc[100]

result:

10000 loops, best of 3: 71.2 µs per loop
10 loops, best of 3: 38.9 ms per loop
10000 loops, best of 3: 134 µs per loop
kmm
  • 6,045
  • 7
  • 43
  • 53
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • 4
    I don't understand the timings at the end. df3 should be faster? – lucid_dreamer Aug 21 '18 at 08:44
  • 6
    @lucid_dreamer I was confused too, but df1 uses the default index which goes from 0 to len(df1) - 1 and is unique, so df1.loc[] uses a hashtable. df2 sets the index to 'x' which is not unique and not sorted, so it does a linear scan, O(N). df3 is the same as df2 but sorted and still non-unique, so it does a binary search. – Max Taggart Oct 17 '18 at 21:26
  • So why is the linear scan of df2 faster? – lucid_dreamer Oct 18 '18 at 09:55
  • I don't get why pandas switches to binary search here. For multimaps, indexing can still be done in O(1+R), instead of O(logN + R) (where R is the number of results returned. – user48956 Feb 03 '19 at 00:12
  • When the index is unique, does it matter whether it's also sorted? And does the answer to this depend on whether the index is a MultiIndex? – BallpointBen Feb 12 '19 at 14:21
  • Is this documented somewhere? – Mitar Dec 14 '20 at 17:28
  • 2
    This timing comparison is actually very misleading, as the first statement `df1.loc[100]` does something quite different than the other two, namely retrieve the 100th row using the implicitly created `RangeIndex`, while the other two retrieve all rows with x == 100. – Ernesto Elsäßer Oct 19 '21 at 13:53
  • 4
    @lucid_dreamer 4 years late but df2 is not faster, speed is measured in milliseconds. Whereas for df1 & df3 the speed is measured in microseconds, easy to miss. – s_wheels Mar 30 '22 at 12:43
  • 1
    Gosh, the world spins right again. Thank you @s_wheels. – lucid_dreamer May 12 '22 at 22:54
44

@HYRY said it well, but nothing says it quite like a colourful graph with timings.

enter image description here

Plots were generated using perfplot. Code, for your reference:

import pandas as pd
import perfplot

_rnd = np.random.RandomState(42)

def make_data(n):    
    x = _rnd.randint(0, 200, n)
    df1 = pd.DataFrame({'x':x})
    df2 = df1.set_index('x', drop=False)
    df3 = df2.sort_index()

    return df1, df2, df3

perfplot.show(
    setup=lambda n: make_data(n),
    kernels=[
        lambda dfs: dfs[0].loc[100],
        lambda dfs: dfs[1].loc[100],        
        lambda dfs: dfs[2].loc[100],
    ],
    labels=['Unique index', 'Non-unique, unsorted index', 'Non-unique, sorted index'],
    n_range=[2 ** k for k in range(8, 23)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=False)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    I'm not seeing where you actually time the operations and am having trouble with timing pandas operations in general. – young_souvlaki Oct 25 '21 at 18:19
  • @young_souvlaki I don't understand, the code is inlined in the answer under the graph, and you will need to install the perfplot library. For the actual methods being tested, check the make_data functions, then check the `kernels` arg to `perfplot.show` – cs95 Oct 26 '21 at 04:34
  • 1
    Ah, `perfplot` is doing the timing. – young_souvlaki Oct 26 '21 at 22:19