Using .loc on row names vs using searchsorted

Question

I have a pandas data frame where I have a sorted column id. I need to reference rows in the data frame by id many times in my code. Is it faster to do it via pd.searchsorted, or by df['id']==value, or by making the id column the key via df = df.set_index('id') and then slicing it by df.loc[[value],:]? Or is there no difference because they all have the same implementation under the hood? For additional info, the data frames are quite large: on the order of 1e7 rows.

cs95 · Accepted Answer · 2023-04-20T16:15:24.583

Consider two scenarios:

the id you're searching for exists
the id you're searching for does not exist

In case 1), both np.searchsorted(df['id'], id) and df.index[df['id'] == id] return the same result. Except that, when the "id" column is sorted, np.searchsorted the answer can be retrieved in O(log N) time. This is because searchsorted uses binary search to traverse the array, as opposed to the equality comparison which returns a result in O(N) time. But this only works if "id" is sorted, otherwise searchsorted behavior is incorrect.

With case 2), it gets a little more tricky. If the id doesn't exist, searchsorted STILL returns a result. It returns the position at which that "id" would've existed in the sorted array had it been present. So the responsibility is on you to double check that the id at that position matches the id you're looking for. With an equality comparison, if the id doesn't exist, the index is falsy.

if you want to speed up lookup, you can set "id" to the DataFrame index under the following conditions:

id is sorted (otherwise you run into performance warnings)
id is unique (although not critical to this working)
you need to do enough number of lookups to justify setting it as the index

Under these conditions, "id" will be hashed and lookups can be constant time O(1) operations that can be done with df.loc[key] or key in df.index.

As for performance, you can easily verify the performance using %timeit.

Timing these two methods is the easiest way to tell.

Setup:

import pandas as pd
import numpy as np

np.random.seed(0)

n = 100000
ids = np.random.randint(0, 1000000, size=n)
values = np.random.randn(n)

df = pd.DataFrame({'id': np.sort(ids), 'value': values})

Verify correctness:

# pick an id that exists
key = 498044
id = np.searchsorted(df['id'], key)
df.iloc[id]
id       498044.000000
value         0.469857
Name: 50000, dtype: float64

id = df.index[df['id'] == key][0]
df.loc[id]
id       498044.000000
value         0.469857
Name: 50000, dtype: float64

Now, we time these operations:

%%timeit 
key = 498044
id = np.searchsorted(df['id'], key)
df.iloc[id]

%%timeit 
id = df.index[df['id'] == key][0]
df.loc[id]

Both return

30.5 µs ± 635 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # 4X faster
121 µs ± 652 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Thanks, I understand the behavior of searchsorted. I was more asking along the lines of whether making the `id` into a row key actually helps (maybe because of some hashing in the background) or if there's fundamentally no difference to what's happening under the hood. — Kevin, Apr 20 '23 at 16:13
@Kevin so for that to work - 1) id needs to be unique, and 2) id needs to also be sorted. Then you can set id to be the index and once that works, things like `df.loc[key]` or `key in df.index` will be O(1) operations. Updated my answer. — cs95, Apr 20 '23 at 16:16
Thanks, it satisfies all those conditions, and just to confirm, setting them as index means doing the operation `df = df.set_index('id')`? — Kevin, Apr 20 '23 at 17:36
@Kevin correctomundo. And FYI if id was not sorted, you could have done `df = df.set_index('id').sort_index()` — cs95, Apr 20 '23 at 17:37

Using .loc on row names vs using searchsorted

1 Answers1