Using the DataFrame.set_index() method

Question

Good morning,

I have a some error and time data in two columns:

edf = pd.DataFrame({'error':error, 'time':time})

Which gives:

            error    time
0     0.000000e+00 -10.000
1     4.219215e-28  -9.995
2     8.870728e-28  -9.990
3     1.398745e-27  -9.985
4     1.960445e-27  -9.980
5     2.575915e-27  -9.975
6     3.249142e-27  -9.970
7     3.984379e-27  -9.965
8     4.786157e-27  -9.960
9     5.659303e-27  -9.955
10    6.608959e-27  -9.950

According to documentation, I can use edf.set_index('time', drop=True) in order to set the time column as my index, and drop it from the its previous place in the data frame (I believe it drops by default). However, this does absolutely nothing. In fact, I was so confused, that I decided to copy and paste the code example straight from documentation, and indeed it does not work either.

df = pd.DataFrame({'month': [1, 4, 7, 10],
                   'year': [2012, 2014, 2013, 2014],
                   'sale': [55, 40, 84, 31]})

Which gives,

   month  year  sale
0      1  2012    55
1      4  2014    40
2      7  2013    84
3     10  2014    31

After which, df.set_index('month') also gives:

   month  year  sale
0      1  2012    55
1      4  2014    40
2      7  2013    84
3     10  2014    31

Instead of what documentation advertises:

       year  sale
month
1      2012    55
4      2014    40
7      2013    84
10     2014    31

What gives?

score 1 · Accepted Answer · answered Oct 15 '19 at 18:32

1

set_index returns the new dataframe by default. So use:

# recommended
edf.set_index('time', drop=True, inplace=True)

or

edf = edf.set_index('time', drop=True)

answered Oct 15 '19 at 18:32

Quang Hoang

146,074
10
56
74

Personally, I always prefer to be explicit and never use `inplace` (i.e. I would always use the second method). In fact, `inplace` is expected to be deprecated. https://github.com/pandas-dev/pandas/issues/16529 – Alexander Oct 15 '19 at 18:35
The linked GitHub issue is still discussion and not final. In my experience, `inplace=True` does sometimes save a lot of memory. – Quang Hoang Oct 15 '19 at 18:42
It is certainly debatable, which is why I highlighted the difference of opinion. I don't believe there would be any difference in memory usage. Do you have any references where I could learn more about that? Here is some more SO discussion on inplace: https://stackoverflow.com/questions/45570984/pandas-is-inplace-true-considered-harmful-or-not – Alexander Oct 15 '19 at 18:48
@Alexander I don't have any reference. That came purely from my experience. However, from your links, [inplace is good](https://stackoverflow.com/questions/34320137/guidelines-on-using-pandas-inplace-keyword-argument/34326313#34326313) does show the difference in memory usage. – Quang Hoang Oct 15 '19 at 18:52
And the [in-place is bad!](https://stackoverflow.com/a/22533110/8425408) link shows they that example may not hold in practice. I think we can agree that it is a point of disagreement as to best practice. – Alexander Oct 15 '19 at 19:20
The bad link only shows run time stats, not memory usage! – Quang Hoang Oct 15 '19 at 19:26
"Often they are actually the same operation that works on a copy" implies higher memory usage. – Alexander Oct 15 '19 at 21:02

score 1 · Answer 2 · answered Oct 15 '19 at 18:34

Most dataframe operations don't modify the original dataframe by default. Instead, they return a new dataframe as a result.

You could assign that result to a new variable, or to the same one:

df = df.set_index('month')

Or you could pass a parameter to the function to tell it to modify the original dataframe in place:

df.set_index('month', inplace=True)

This tripped me up a lot when I started working with Pandas.

Using the DataFrame.set_index() method

2 Answers2