Why is merging dataframes in Pandas on an index more efficient (faster) than on a column?
import pandas as pd
# Dataframes share the ID column
df = pd.DataFrame({'ID': [0, 1, 2, 3, 4],
'Job': ['teacher', 'scientist', 'manager', 'teacher', 'nurse']})
df2 = pd.DataFrame({'ID': [2, 3, 4, 5, 6, 7, 8],
'Level': [12, 15, 14, 20, 21, 11, 15],
'Age': [33, 41, 42, 50, 45, 28, 32]})
df = df.set_index('ID')
df2 = df2.set_index('ID')
This represents about a 3.5 times speed up! (Using Pandas 0.23.0)
Reading through the Pandas internals page it says an Index "Populates a dict of label to location in Cython to do O(1) lookups." Does this mean that doing operations with an index is more efficient than with columns? Is it a best practice to always use the index for operations such as merges?
I read through the documentation for joining and merging and it doesn't explicitly mention any benefits to using the index.