Pandas allows both unique and non-unique indices. Some of the operations only allow unique indices. In what situation would it make sense to use non-unique indices? I think enforcing uniqueness of the indices can help discover data integrity problems upfront.
2 Answers
Disclaimer: a unique RangeIndex
is always going to be the most performant option. This question seems to favour using a unique index and is specifically looking for cases where allowing non-unique indexes is desired. For this reason, from this point forward unique indexes are not going to be discussed, nor is performance, only the useful benefits of non-unique indexes.
The general places non-unique indexes are preferable whenever we need to keep track of where data originally came from. There are many cases where, in the intermediary phases, we need to know what row the data was at. This lets us do computations with respect to information that would either be lost if the index was unique, or would require adding an additional column to track it. Below are just a few examples:
Interleaving multiple DataFrames:
Consider the following 2 DataFrames, let's assume that each DataFrame represents a day's worth of data. We would like to review this daily data by Sample number rather than by day:
df1 = pd.DataFrame([['10:05', 'Day 1', 'Sample 1'],
['11:14', 'Day 1', 'Sample 2']])
df2 = pd.DataFrame([['10:03', 'Day 2', 'Sample 1'],
['11:12', 'Day 1', 'Sample 2']])
# df1
0 1 2
0 10:05 Day 1 Sample 1
1 11:14 Day 1 Sample 2
#df2
0 1 2
0 10:03 Day 2 Sample 1
1 11:12 Day 1 Sample 2
Because pandas allows non-unique indexes we can concat
then sort_index
:
pd.concat([df1, df2]).sort_index()
0 1 2
0 10:05 Day 1 Sample 1
0 10:03 Day 2 Sample 1
1 11:14 Day 1 Sample 2
1 11:12 Day 1 Sample 2
Notice this is the fastest way to interleave two DataFrames by row index. Also notice that it would not be feasible to instead sort by columns 1
and 2
as the words Day 1
Sample 1
etc. will be lexicographically sorted which will run into issues for values like Day 10
, or would require a bunch of additional computation to handle the numeric values correctly.
We can add ignore_index=True
to sort_index
, but this only hides away overwriting with a new range index and still relies on the fact that concat
returns a DataFrame with non-unique indexes.
pd.concat([df1, df2]).sort_index(ignore_index=True)
0 1 2
0 10:05 Day 1 Sample 1
1 10:03 Day 2 Sample 1
2 11:14 Day 1 Sample 2
3 11:12 Day 1 Sample 2
Explode and Reduce
explode
, particularly on Series, is a common operation and not losing the index (allowing duplicates) makes it so much easier to do expand and reduce type operations.
The goal is to remove any duplicate values from within a comma separated string within a column:
df = pd.DataFrame({
'corresponding_id': [10, 20, 30],
'col': ['a,b,c,a', 'b,c,c,b', 'a,a,a,a']
})
df
:
corresponding_id col
0 10 a,b,c,a
1 20 b,c,c,b
2 30 a,a,a,a
A common solution may look something like:
df['col'] = (
df['col'].str.split(',').explode()
.groupby(level=0).apply(lambda s: ','.join(np.unique(s)))
)
df
:
corresponding_id col
0 10 a,b,c
1 20 b,c
2 30 a
After exploding the result looks like:
df['col'].str.split(',').explode()
0 a
0 b
0 c
0 a
1 b
1 c
1 c
1 b
2 a
2 a
2 a
2 a
Name: col, dtype: object
Because there are duplicate indexes we can groupby
relative to level=0
(the index) this is only possible because the index was preserved. If the index did not allow duplicates we would have:
0 a
1 b
2 c
3 a
4 b
5 c
6 c
7 b
8 a
9 a
10 a
11 a
Name: col, dtype: object
There would be no way to easily determine from which rows the values came from, making it even more difficult to put them back in place.
Scaling Up a DataFrame
The ability to select from a DataFrame using duplicate labels is extremely helpful in scaling up a DataFrame.
df = pd.DataFrame({
'Count': [2, 4],
'Value': [1, 6]
})
On occasion we need to scale up a DataFrame, in these cases we use loc
to select from the DataFrame:
df.loc[[0, 0, 1, 1, 1, 1], :]
Notice the result is:
Count Value
0 2 1
0 2 1
1 4 6
1 4 6
1 4 6
1 4 6
We were able to select the same row multiple times from the DataFrame based on duplicate labels (and the resulting index is non-unique). This is so common that there is a method Index.repeat
that does this dynamically based on a column:
df.loc[df.index.repeat(df['Count']), :]
Count Value
0 2 1
0 2 1
1 4 6
1 4 6
1 4 6
1 4 6

- 34,399
- 18
- 41
- 57
There are different operations where you (at least temporarily) may want to retain non-unique index values to be able to comprehend where the according rows came from / which ones belong together. Such operations include
Often, it may make sense at some point later to reindex or reset the index, since performance-wise, non-unique indices may hava a huge impact, especially when unsorted.

- 4,552
- 14
- 29
- 49