1

I would like to have a quick index lookup when using a pandas dataframe. As noted here and there, I understand that I need to keep my index unique, otherwise all hope is lost.

I made sure that the index is sorted and unique:

df = df.sort_index()
assert df.index.is_unique

I measured the lookup speed:

%timeit df.at[tuple(q.values), 'column']
1000 loops, best of 3: 185 µs per loop

Then, when I moved the index to a separate python dictionary:

index = {}
for i in df.index:
  index[i] = np.random.randint(1000)
assert len(index) == len(df.index)

I got a huge speedup:

%timeit index[tuple(q.values)]
100000 loops, best of 3: 2.7 µs per loop

Why is it so? Am I doing something wrong? Is there a way to replicate python dict's speed (or something in <5x range) in a pandas index?

sygi
  • 4,557
  • 2
  • 32
  • 54
  • Interesting behaviour. My best guess is that a dataframe comes with a lot of baggage a dict has not. Also dataframe are not optimized for lookups but for matrix operations. – above_c_level Mar 26 '20 at 18:07
  • That makes sense to be an underlying issue, but still there's a question of how to disable the pandas overhead to get close to the performance of a dict. I am currently just placing a dict on top of pandas index, but surely someone would hit this problem before and merged the solution into pandas? Or is there an inherent reason that makes having a proper dict as an index impossible? – sygi Mar 27 '20 at 09:56

0 Answers0