Why does Pandas allow non-unique index?

Question

Pandas allows both unique and non-unique indices. Some of the operations only allow unique indices. In what situation would it make sense to use non-unique indices? I think enforcing uniqueness of the indices can help discover data integrity problems upfront.

If you perform operations like `df.explode`, it is necessary to retain non-unique indices to indicate that those rows originally belong to the same group. — Henry Yik, Oct 16 '21 at 07:02

score 3 · Accepted Answer · answered Nov 08 '21 at 21:52

Disclaimer: a unique RangeIndex is always going to be the most performant option. This question seems to favour using a unique index and is specifically looking for cases where allowing non-unique indexes is desired. For this reason, from this point forward unique indexes are not going to be discussed, nor is performance, only the useful benefits of non-unique indexes.

The general places non-unique indexes are preferable whenever we need to keep track of where data originally came from. There are many cases where, in the intermediary phases, we need to know what row the data was at. This lets us do computations with respect to information that would either be lost if the index was unique, or would require adding an additional column to track it. Below are just a few examples:

Interleaving multiple DataFrames:

Consider the following 2 DataFrames, let's assume that each DataFrame represents a day's worth of data. We would like to review this daily data by Sample number rather than by day:

df1 = pd.DataFrame([['10:05', 'Day 1', 'Sample 1'],
                    ['11:14', 'Day 1', 'Sample 2']])
df2 = pd.DataFrame([['10:03', 'Day 2', 'Sample 1'],
                    ['11:12', 'Day 1', 'Sample 2']])

# df1
       0      1         2
0  10:05  Day 1  Sample 1
1  11:14  Day 1  Sample 2

#df2
       0      1         2
0  10:03  Day 2  Sample 1
1  11:12  Day 1  Sample 2

Because pandas allows non-unique indexes we can concat then sort_index:

pd.concat([df1, df2]).sort_index()

       0      1         2
0  10:05  Day 1  Sample 1
0  10:03  Day 2  Sample 1
1  11:14  Day 1  Sample 2
1  11:12  Day 1  Sample 2

Notice this is the fastest way to interleave two DataFrames by row index. Also notice that it would not be feasible to instead sort by columns 1 and 2 as the words Day 1 Sample 1 etc. will be lexicographically sorted which will run into issues for values like Day 10, or would require a bunch of additional computation to handle the numeric values correctly.

We can add ignore_index=True to sort_index, but this only hides away overwriting with a new range index and still relies on the fact that concat returns a DataFrame with non-unique indexes.

pd.concat([df1, df2]).sort_index(ignore_index=True)

       0      1         2
0  10:05  Day 1  Sample 1
1  10:03  Day 2  Sample 1
2  11:14  Day 1  Sample 2
3  11:12  Day 1  Sample 2

Explode and Reduce

explode, particularly on Series, is a common operation and not losing the index (allowing duplicates) makes it so much easier to do expand and reduce type operations.

The goal is to remove any duplicate values from within a comma separated string within a column:

df = pd.DataFrame({
    'corresponding_id': [10, 20, 30],
    'col': ['a,b,c,a', 'b,c,c,b', 'a,a,a,a']
})

df:

   corresponding_id      col
0                10  a,b,c,a
1                20  b,c,c,b
2                30  a,a,a,a

A common solution may look something like:

df['col'] = (
    df['col'].str.split(',').explode()
        .groupby(level=0).apply(lambda s: ','.join(np.unique(s)))
)

df:

   corresponding_id    col
0                10  a,b,c
1                20    b,c
2                30      a

After exploding the result looks like:

df['col'].str.split(',').explode()

0    a
0    b
0    c
0    a
1    b
1    c
1    c
1    b
2    a
2    a
2    a
2    a
Name: col, dtype: object

Because there are duplicate indexes we can groupby relative to level=0 (the index) this is only possible because the index was preserved. If the index did not allow duplicates we would have:

0     a
1     b
2     c
3     a
4     b
5     c
6     c
7     b
8     a
9     a
10    a
11    a
Name: col, dtype: object

There would be no way to easily determine from which rows the values came from, making it even more difficult to put them back in place.

Scaling Up a DataFrame

The ability to select from a DataFrame using duplicate labels is extremely helpful in scaling up a DataFrame.

df = pd.DataFrame({
    'Count': [2, 4],
    'Value': [1, 6]
})

On occasion we need to scale up a DataFrame, in these cases we use loc to select from the DataFrame:

df.loc[[0, 0, 1, 1, 1, 1], :]

Notice the result is:

   Count  Value
0      2      1
0      2      1
1      4      6
1      4      6
1      4      6
1      4      6

We were able to select the same row multiple times from the DataFrame based on duplicate labels (and the resulting index is non-unique). This is so common that there is a method Index.repeat that does this dynamically based on a column:

df.loc[df.index.repeat(df['Count']), :]

   Count  Value
0      2      1
0      2      1
1      4      6
1      4      6
1      4      6
1      4      6

score 0 · Answer 2 · answered Oct 16 '21 at 09:22

There are different operations where you (at least temporarily) may want to retain non-unique index values to be able to comprehend where the according rows came from / which ones belong together. Such operations include

Often, it may make sense at some point later to reindex or reset the index, since performance-wise, non-unique indices may hava a huge impact, especially when unsorted.

Why does Pandas allow non-unique index?

2 Answers2

Interleaving multiple DataFrames:

Explode and Reduce

Scaling Up a DataFrame