I have a pandas data frame where I have a sorted column id
. I need to reference rows in the data frame by id
many times in my code. Is it faster to do it via pd.searchsorted
, or by df['id']==value
, or by making the id
column the key via df = df.set_index('id')
and then slicing it by df.loc[[value],:]
? Or is there no difference because they all have the same implementation under the hood? For additional info, the data frames are quite large: on the order of 1e7 rows.

- 149
- 6
1 Answers
Consider two scenarios:
- the id you're searching for exists
- the id you're searching for does not exist
In case 1), both np.searchsorted(df['id'], id)
and df.index[df['id'] == id]
return the same result. Except that, when the "id" column is sorted, np.searchsorted
the answer can be retrieved in O(log N) time. This is because searchsorted
uses binary search to traverse the array, as opposed to the equality comparison which returns a result in O(N) time. But this only works if "id" is sorted, otherwise searchsorted
behavior is incorrect.
With case 2), it gets a little more tricky. If the id
doesn't exist, searchsorted
STILL returns a result. It returns the position at which that "id" would've existed in the sorted array had it been present. So the responsibility is on you to double check that the id at that position matches the id you're looking for. With an equality comparison, if the id doesn't exist, the index is falsy.
if you want to speed up lookup, you can set "id" to the DataFrame index under the following conditions:
- id is sorted (otherwise you run into performance warnings)
- id is unique (although not critical to this working)
- you need to do enough number of lookups to justify setting it as the index
Under these conditions, "id" will be hashed and lookups can be constant time O(1) operations that can be done with df.loc[key]
or key in df.index
.
As for performance, you can easily verify the performance using %timeit.
Timing these two methods is the easiest way to tell.
Setup:
import pandas as pd
import numpy as np
np.random.seed(0)
n = 100000
ids = np.random.randint(0, 1000000, size=n)
values = np.random.randn(n)
df = pd.DataFrame({'id': np.sort(ids), 'value': values})
Verify correctness:
# pick an id that exists
key = 498044
id = np.searchsorted(df['id'], key)
df.iloc[id]
id 498044.000000
value 0.469857
Name: 50000, dtype: float64
id = df.index[df['id'] == key][0]
df.loc[id]
id 498044.000000
value 0.469857
Name: 50000, dtype: float64
Now, we time these operations:
%%timeit
key = 498044
id = np.searchsorted(df['id'], key)
df.iloc[id]
%%timeit
id = df.index[df['id'] == key][0]
df.loc[id]
Both return
30.5 µs ± 635 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # 4X faster
121 µs ± 652 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

- 379,657
- 97
- 704
- 746
-
Thanks, I understand the behavior of searchsorted. I was more asking along the lines of whether making the `id` into a row key actually helps (maybe because of some hashing in the background) or if there's fundamentally no difference to what's happening under the hood. – Kevin Apr 20 '23 at 16:13
-
1@Kevin so for that to work - 1) id needs to be unique, and 2) id needs to also be sorted. Then you can set id to be the index and once that works, things like `df.loc[key]` or `key in df.index` will be O(1) operations. Updated my answer. – cs95 Apr 20 '23 at 16:16
-
Thanks, it satisfies all those conditions, and just to confirm, setting them as index means doing the operation `df = df.set_index('id')`? – Kevin Apr 20 '23 at 17:36
-
1@Kevin correctomundo. And FYI if id was not sorted, you could have done `df = df.set_index('id').sort_index()` – cs95 Apr 20 '23 at 17:37