13

I have df:

df = pd.DataFrame({'a':[7,8,9],
                   'b':[1,3,5],
                   'c':[5,3,6]})

print (df)
   a  b  c
0  7  1  5
1  8  3  3
2  9  5  6

Then rename first value by this:

df.columns.values[0] = 'f'

All seems very nice:

print (df)
   f  b  c
0  7  1  5
1  8  3  3
2  9  5  6

print (df.columns)
Index(['f', 'b', 'c'], dtype='object')

print (df.columns.values)
['f' 'b' 'c']

If select b it works nice:

print (df['b'])
0    1
1    3
2    5
Name: b, dtype: int64

But if select a it return column f:

print (df['a'])
0    7
1    8
2    9
Name: f, dtype: int64

And if select f get keyerror.

print (df['f'])
#KeyError: 'f'

print (df.info())
#KeyError: 'f'

What is problem? Can somebody explain it? Or bug?

Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • There is a mention about this behaviour in the comments of this [answer](http://stackoverflow.com/a/11346337/3423035). Since one is modifying the internal state of this index object it might not get propagated to all instances using it. I think the using `df.rename(columns={'a': 'f'})` is the intended way to go. – Jan Trienes Apr 08 '17 at 09:11

1 Answers1

26

You aren't expected to alter the values attribute.

Try df.columns.values = ['a', 'b', 'c'] and you get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-61-e7e440adc404> in <module>()
----> 1 df.columns.values = ['a', 'b', 'c']

AttributeError: can't set attribute

That's because pandas detects that you are trying to set the attribute and stops you.

However, it can't stop you from changing the underlying values object itself.

When you use rename, pandas follows up with a bunch of clean up stuff. I've pasted the source below.

Ultimately what you've done is altered the values without initiating the clean up. You can initiate it yourself with a followup call to _data.rename_axis (example can be seen in source below). This will force the clean up to be run and then you can access ['f']

df._data = df._data.rename_axis(lambda x: x, 0, True)
df['f']

0    7
1    8
2    9
Name: f, dtype: int64

Moral of the story: probably not a great idea to rename a column this way.


but this story gets weirder

This is fine

df = pd.DataFrame({'a':[7,8,9],
                   'b':[1,3,5],
                   'c':[5,3,6]})

df.columns.values[0] = 'f'

df['f']

0    7
1    8
2    9
Name: f, dtype: int64

This is not fine

df = pd.DataFrame({'a':[7,8,9],
                   'b':[1,3,5],
                   'c':[5,3,6]})

print(df)

df.columns.values[0] = 'f'

df['f']
KeyError:

Turns out, we can modify the values attribute prior to displaying df and it will apparently run all the initialization upon the first display. If you display it prior to changing the values attribute, it will error out.

weirder still

df = pd.DataFrame({'a':[7,8,9],
                   'b':[1,3,5],
                   'c':[5,3,6]})

print(df)

df.columns.values[0] = 'f'

df['f'] = 1

df['f']

   f  f
0  7  1
1  8  1
2  9  1

As if we didn't already know that this was a bad idea...


source for rename

def rename(self, *args, **kwargs):

    axes, kwargs = self._construct_axes_from_arguments(args, kwargs)
    copy = kwargs.pop('copy', True)
    inplace = kwargs.pop('inplace', False)

    if kwargs:
        raise TypeError('rename() got an unexpected keyword '
                        'argument "{0}"'.format(list(kwargs.keys())[0]))

    if com._count_not_none(*axes.values()) == 0:
        raise TypeError('must pass an index to rename')

    # renamer function if passed a dict
    def _get_rename_function(mapper):
        if isinstance(mapper, (dict, ABCSeries)):

            def f(x):
                if x in mapper:
                    return mapper[x]
                else:
                    return x
        else:
            f = mapper

        return f

    self._consolidate_inplace()
    result = self if inplace else self.copy(deep=copy)

    # start in the axis order to eliminate too many copies
    for axis in lrange(self._AXIS_LEN):
        v = axes.get(self._AXIS_NAMES[axis])
        if v is None:
            continue
        f = _get_rename_function(v)

        baxis = self._get_block_manager_axis(axis)
        result._data = result._data.rename_axis(f, axis=baxis, copy=copy)
        result._clear_item_cache()

    if inplace:
        self._update_inplace(result._data)
    else:
        return result.__finalize__(self)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 4
    Very interesting research! – MaxU - stand with Ukraine Apr 08 '17 at 11:36
  • 2
    I am thinking about how can `print` cause this difference. Do you have some idea why? Never seen it before. – jezrael Apr 09 '17 at 06:07
  • @jezrael my theory is that there is initialization that happens upon the first print. – piRSquared Apr 09 '17 at 15:45
  • But it is bug, becuase influence of `print`? I think it is impossible, but maybe I am wrong . – jezrael Apr 09 '17 at 15:47
  • @jezrael when print is called, it calls the __repr__ method. At that point I'm guessing pandas runs some caching scripts if they haven't run before. – piRSquared Apr 09 '17 at 16:01
  • Hmmm, I check [this](http://stackoverflow.com/a/42978728/2901002) and try print `print (df.__dict__)` before and after `rename`. And after `print(df)` it is changed, some `_iloc` is added. But why? It is really weird. – jezrael Apr 09 '17 at 16:02
  • @jezrael agreed. Definitely weird – piRSquared Apr 09 '17 at 17:13
  • Thank you for digging into this. I'm still confused on the best approach to rename columns. I'm using the 'df._data = df._data.rename_axis(lambda x: x, 0, True)' workaround for now. – adivis12 Jan 05 '18 at 00:07
  • @adivis12 this issue is unrelated to how you should rename columns or an index. See https://stackoverflow.com/a/46192213/2336654 – piRSquared Jan 05 '18 at 00:35