3

Given a large data frame (in my case 250M rows and 30 cols), why is it so slow to just change then name of a column?

I am using df.rename(columns={'oldName':'newName'},inplace=True) so this should not make any copies of the data, yet it is taking over 30 seconds, while I would have expected this to be in the order of milliseconds (as it's just replacing one string by another).

I know, that' a huge table, more than most people have RAM in their machine (hence I'm not going to add example code either), but still this shouldn't take any significant amount of time as it's not actually touching any of the data. Why does this take so long, i.e. why is renaming a column doing effort proportional to the number of rows of my dataframe?

Peter
  • 501
  • 5
  • 14
  • 2
    Try renaming your columns by not using `inplace`: `df = df.rename(columns={'oldName':'newName'})` – It_is_Chris Oct 08 '20 at 14:09
  • 2
    Also performance [in this question](https://stackoverflow.com/questions/22532302/pandas-peculiar-performance-drop-for-inplace-rename-after-dropna/22533110#22533110) shows rename with `inplace` is actually slower than without. – Quang Hoang Oct 08 '20 at 14:15
  • @Quang Hoang: It does not show that. After the uncertainty of .dropna is eliminated renaming inplace is faster. My own tests confirm that, especially for large dataframes. I used Python 3.96 and pandas 1.3.1 – Martin R Sep 10 '22 at 02:57

1 Answers1

7

I don't think inplace=True doesn't copy your data. There are some discussion on SO saying it actually does copy, and then assign back. Also see this github issue.

You can just override the columns with:

df.columns = df.columns.to_series().replace({'a':'b'})
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74