19

I was confused by this, which is very simple but I didn't immediately find the answer on StackOverflow:

  • df.set_index('xcol') makes the column 'xcol' become the index (when it is a column of df).

  • df.reindex(myList), however, takes indexes from outside the dataframe, for example, from a list named myList that we defined somewhere else.

However, df.reindex(myList) also changes values to NAs. A simple alternative is: df.index = myList

I hope this post clarifies it! Additions to this post are also welcome!

Ricardo Guerreiro
  • 497
  • 1
  • 4
  • 17

3 Answers3

24

You can see the difference on a simple example. Let's consider this dataframe:

df = pd.DataFrame({'a': [1, 2],'b': [3, 4]})
print (df)
   a  b
0  1  3
1  2  4

Indexes are then 0 and 1

If you use set_index with the column 'a' then the indexes are 1 and 2. If you do df.set_index('a').loc[1,'b'], you will get 3.

Now if you want to use reindex with the same indexes 1 and 2 such as df.reindex([1,2]), you will get 4.0 when you do df.reindex([1,2]).loc[1,'b']

What happend is that set_index has replaced the previous indexes (0,1) with (1,2) (values from column 'a') without touching the order of values in the column 'b'

df.set_index('a')
   b
a   
1  3
2  4

while reindex change the indexes but keeps the values in column 'b' associated to the indexes in the original df

df.reindex(df.a.values).drop('a',1) # equivalent to df.reindex(df.a.values).drop('a',1)
     b
1  4.0
2  NaN
# drop('a',1) is just to not care about column a in my example

Finally, reindex change the order of indexes without changing the values of the row associated to each index, while set_index will change the indexes with the values of a column, without touching the order of the other values in the dataframe

Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • 1
    Great explanation! – prosti May 15 '19 at 12:09
  • 1
    Just a brief usage comment, pandas recommends using `at` rather than `loc` for single-cell indexing: `df.at[1, 'b']`. Loc is generally meant for accessing ranges. – ntjess Feb 15 '21 at 18:51
7

Just to add, the undo to set_index would be reset_index method (more or less):

df = pd.DataFrame({'a': [1, 2],'b': [3, 4]})
print (df)

df.set_index('a', inplace=True)
print(df)

df.reset_index(inplace=True, drop=False)
print(df)

   a  b
0  1  3
1  2  4
   b
a   
1  3
2  4
   a  b
0  1  3
1  2  4
prosti
  • 42,291
  • 14
  • 186
  • 151
4

Besides great answer from Ben. T, I would like to give one more example of how they are different when you use reindex and set_index to an index column

import pandas as pd
import numpy as np
testdf = pd.DataFrame({'a': [1, 3, 2],'b': [3, 5, 4],'c': [5, 7, 6]})

print(testdf)
print(testdf.set_index(np.random.permutation(testdf.index)))
print(testdf.reindex(np.random.permutation(testdf.index)))

Output:

  • With set_index, when index column (the first column) is shuffled, the order of other columns are kept intact
  • With reindex, the order of rows are changed accordingly to the shuffle of index column.
   a  b  c
0  1  3  5
1  3  5  7
2  2  4  6
   a  b  c
1  1  3  5
2  3  5  7
0  2  4  6
   a  b  c
2  2  4  6
1  3  5  7
0  1  3  5
Long
  • 1,482
  • 21
  • 33