0

In the codebase I'm working in, I see the following

df1 = some pandas dataframe (one of the columns is quantity)
df1_subset = df1[df1.quantity == input_quantity].copy()

I am wondering why the .copy() is needed here? Doesn't df1_subset = df1[df1.quantity == input_quantity] make df1_subset a copy and not a reference?

I saw in why should I make a copy of a data frame in pandas that a reference would be returned, but I think this is outdated?

24n8
  • 1,898
  • 1
  • 12
  • 25
  • 2
    This is a safety to avoid a `SettingwithCopyWarning`, using `df1.loc[df1.quantity == input_quantity]` should prevent the need for a copy (provided prior slicing was done properly). See [this Q/A](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas) for further details on the warning. – mozway Jan 11 '23 at 14:42
  • *Doesn't df[...] make a copy* No it's not guaranteed, and in fact it will try to give a view back if it can. Even with `.loc`, the need for copy won't go away to suppress the warning. If you will do modifications to the subset, and you're not having memory concerns (which is rare), you should always chain a copy at the end – Mustafa Aydın Jan 11 '23 at 14:51
  • Just try. Do without copy and modify some data on original dataframe, and check if it is modified also in the subset. Note: it may be more memory efficient not to do a copy, but just having the indices (think on wide df with a lot of columns). Possibly Pandas want the freedom to choose the method – Giacomo Catenazzi Jan 12 '23 at 09:10

0 Answers0