1

I have a very large dataframe, around 80GB. I want to change the type of some of its columns from object to category. Trying to do it this way:

df[col_name] = df[col_name].astype('category') 

Takes around 1 minute per column, which is a lot. My first question would be why does it take that long? Just running:

df[col_name].astype('category') 

takes just around 1 second. I tried something like:

temp = df[col_name].astype('category')
df.drop(columns=[col_name])
df[col_name] = temp

In this case it turns out that dropping the column is also very slow. Now, I also tried replacing drop by del, that is

temp = df[col_name].astype('category')
del df[col_name]
df[col_name] = temp

Surprisingly (for me) this was very fast. So My second question is why is del so much faster than drop in this case? What is the most "correct" way of doing this conversion, and what is the most efficient (hopefully they are the same)? Thanks

user25640
  • 225
  • 2
  • 10
  • `del` would map this operation to df.__delitem__('column name'), which is an internal method of DataFrame. `df.pop(col_name)` is also faster than `drop`. `del` is not recommended to delete columnaccording to the answer to this question [Delete a column from a Pandas DataFrame](https://stackoverflow.com/questions/13411544/delete-a-column-from-a-pandas-dataframe) – Lazyer Jun 30 '22 at 02:24

1 Answers1

0

You could use something like

df['col_name'].values.astype('category')
ArchAngelPwn
  • 2,891
  • 1
  • 4
  • 17