1

I am trying to understand pandas SettingWithCopyWarning, what exactly triggers it an how to avoid it. I want to take a selection of columns from a data frame and then work with this selection of columns. I need to fill missing values and replace all values larger than 1 with 1.

I understand that sub_df=df[['col1', 'col2', 'col3']] produces a copy and that seems to be what I want. Could someone explain why the copy warning is triggered here, whether it's problem, and how I should avoid it?

I read a lot about chained assignment in this context, am I doing this here?

data={'col1' : [25 , 0, 100, None],
    'col2' : [50 , 0 , 0, None],
      'col3' : [None, None, None, 100],
      'col4' : [ 20 , 20 , 20 , 20 ],
      'col5' : [1,1,2,3]}
df= pd.DataFrame(data)
sub_df=df[['col1', 'col2', 'col3']]
sub_df.fillna(0, inplace=True)
sub_df[df>1]=1 # produces the copy warning
sub_df

What really confuses me is why this warning is not triggered if I am not using a new name for my subset of columns as below:

data={'col1' : [25 , 0, 100, None],
    'col2' : [50 , 0 , 0, None],
      'col3' : [None, None, None, 100],
      'col4' : [ 20 , 20 , 20 , 20 ],
      'col5' : [1,1,2,3]}
df= pd.DataFrame(data)
df=df[['col1', 'col2', 'col3']]
df.fillna(0, inplace=True)
df[df>1]=1 # does not produce the copy warning
df

Thanks!

Eva
  • 31
  • 6
  • The 2 code snippets are semantically different, in the first one it's ambiguous whether you want to operate on a copy or a view of the orig df, on the second one you overwrite the original df with a subset of the original df – EdChum May 25 '16 at 11:19
  • thanks, so is the solution to make it clear to pandas that I do want to create a copy? how exactly would I do that? – Eva May 25 '16 at 11:52
  • It depends on what you want to do, if your intention is to operate on a copy then do `sub_df=df[['col1', 'col2', 'col3']].copy()` if your intention is to operate on a view, you'd better off defining a list of the cols and using the new [`indexers`](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing) so `df[col_list'].fillna(0)` and then `df.loc[df > 1, col_list] = 1` – EdChum May 25 '16 at 12:03
  • thanks! that's clear now. – Eva May 25 '16 at 12:08

1 Answers1

1

Your 2 code snippets are semantically different, in the first it's ambiguous whether you want to operate on a view or a copy of the original df, in the second you overwrite df with a subset of the df so there is no ambiguity.

If you want to operate on a copy then do this:

sub_df=df[['col1', 'col2', 'col3']].copy()

if you want to operate on a view then I suggest using a list of cols and referencing them using the new indexers like the following:

df[col_list].fillna(0) 

and then

df.loc[df > 1, col_list] = 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Will that first line not do two copies? – Konstantin Aug 22 '17 at 15:20
  • @Konstantin sorry I don't understand, the first line creates a second df which is a sub-section of the orig df but it will be a distinct copy, not a view – EdChum Aug 22 '17 at 15:21
  • I was just wondering if df[['col1', 'col2', 'col3']] not already creates a copy, so that adding a .copy() to it would produce a second copy. – Konstantin Aug 23 '17 at 10:06
  • @Konstantin it depends, it may return a copy or a view, a warning will be raised if you try to modify this, this is where the confusion arises so you need to be explicit in your intentions – EdChum Aug 23 '17 at 10:08