4

I have a dataframe with multiple columns and I simply want to update a column with new values df['Z'] = df['A'] % df['C']/2. However, I keep getting SettingWithCopyWarning message even when I use the .loc[] method or when I drop() the column and add it again.

:75: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Although the warning disappears with .assign() method, but it is painstakingly slower. Here is a comparison

df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

%timeit df['Z'] = df['A'] % df['C']/2
119 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.loc[:, 'Z'] = df['A'] % df['C']/2
118 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.assign(Z=df['A'] % df['C']/2)
857 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So what's the optimal way to update a column in the dataframe. Note that I don't have the option to create multiple copies of the same dataframe because of its huge size.

exan
  • 3,236
  • 5
  • 24
  • 38
  • Did you try your sample data, I have not received the SettingWithCopyWarning – BENY Aug 14 '20 at 01:01
  • 1
    https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – Quantum Dreamer Aug 14 '20 at 01:04
  • this is just a warning - it won't affect anything. FWIW, this changes across versions and I don't see in version 1.01 – anon01 Aug 14 '20 at 01:11
  • 1
    I do get these warnings even when using .loc, you can stop them by calling `pd.set_option('mode.chained_assignment', None)` in main... pandas [explanation](https://pandas.pydata.org/docs/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing) does not convince me, it might be a bug – RichieV Aug 14 '20 at 01:11
  • @BEN_YO Actually this sample code is meant for comparing the three assignment operations only. – exan Aug 14 '20 at 01:25
  • I tried using `pd.DataFrame.eval` no avail. It is as slow as assign. `df.eval('Z = A % C/2')` – Scott Boston Aug 14 '20 at 03:56

1 Answers1

3

tl;dr - make a copy of the slice using copy or suppress the warning with pd.set_option('mode.chained_assignment', None)

There are some great posts about SettingWithCopy Warnings. First off, I say, this is just a warning and not an error. Most of the time this is warning you of behavior you didn't really intend to happen anyway or you really don't care.

Now, let's avoid this warning. Giving your data I am going to duplicate the warning first on purpose.

df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

# if we use execute df['Z'] = df['A'] % df['C']/2 no warning here.
df['Z'] = df['A'] % df['C']/2

# However, let's slice this dataframe just removing the last row using this syntax
df_slice = df.loc[:1999998]
df_slice['Z'] = df_slice['A'] % df_slice['C']/2

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.

In this case, this warning is letting you know you are changing the original df object.

df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
df_slice = df.loc[:1999998]
df_slice['Z'] = df_slice['A'] % df_slice['C']/2
all(df.loc[:1999998, 'Z'] == df_slice['Z'])

Returns the above warning and True, modifying the slice did change the original df object.

Now, to avoid the warning and not changing the original object use copy

df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

df_slice = df.loc[:1999998].copy()
df_slice['Z'] = df_slice['A'] % df_slice['C']/2
all(df.loc[:1999998, 'Z'] == df_slice['Z'])

Returns no warning and False.

So, this is one way to use retaining your performance with first and second methods by using .copy() when creating your slice/view of a dataframe. However, you are correct this does take extra memory. Overwrite your dataframe with .copy()

Or you can turn this warning off using:

pd.set_option('mode.chained_assignment', None)
df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

df_slice = df.loc[:1999998]
df_slice['Z'] = df_slice['A'] % df_slice['C']/2
all(df.loc[:1999998, 'Z'] == df_slice['Z'])

Returns No warning and True.

In short, pandas sometimes creates a new object for slices of a dataframe, and sometimes it doesn't, where this new slice is a view of the original dataframe. When pandas does this is understood by few and not very well documented I where I could find it.

There is a strong hint to when this warning will appear and that is to use the _is_view attribute.

df_slice = df.loc[:1999998]
df_slice._is_view

Returns True, hence the SettingWithCopyError might happen.

df_slice = df.loc[:1999998].copy()
df_slice._is_view

Returns False.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • **_is_view** is a good trick and it does seem to work on my example dataframe. However, it doesn't seem to work everytime. In my original dataframe (which I can't share) even though **is_view** returns False but I still get the warning. – exan Aug 14 '20 at 14:29
  • in your example ```df_slice = df.loc[:1999998]```, I can see different data in ```df_slice``` and in ```df.loc[:1999998]```. Then why does the ```all(df.loc[:1999998, 'Z'] == df_slice['Z']) ``` comparison is returning all True? – exan Aug 14 '20 at 14:48
  • _is_view is only a hint. I could not find documentation on when pandas creates a veiw or a new object. If you don't use .copy() then when you change df_slice, you also change df, therefore all(df.loc...) will yield True. If you did use .copy() then you have a separate object and all(df.loc..) will yield False. Import that you recreate df inbetween each of these test. – Scott Boston Aug 14 '20 at 16:45