Another, minor, question: is it bad practice to have a non-unique
indexes?
It is not great practice, but depends on your needs and can be okay in some circumstances.
Issue 1: join operations
A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex
-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join
or .merge
, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.
For example:
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
0 1 a b
a 0.73737 1.49073 0.73737 1.49073
a 0.73737 1.49073 -0.25562 -2.79859
a -0.25562 -2.79859 0.73737 1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583 1.17583 -0.93583 1.17583
b -0.93583 1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583 1.17583
Issue 2: performance
Unique-valued indices make certain operations efficient, as explained in this post.
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
A word on .loc
Using .loc
will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
print(df.loc['a'])
0 1
a 0.73737 1.49073
a -0.25562 -2.79859