17

Trying to figure out why the below function is returning the dreaded SettingWithCopyWarning... Here is my function that intends to modify the dataframe df by reference.

def remove_outliers_by_group(df, cols):
    """
    Removes outliers based on median and median deviation computed using cols
    :param df: The dataframe reference
    :param cols: The columns to compute the median and median dev of
    :return:
    """
    flattened = df[cols].as_matrix().reshape(-1, )
    median = np.nanmedian(flattened)
    median_dev = np.nanmedian(np.abs(flattened) - median)
    for col in cols:
        df[col] = df[col].apply(lambda x: np.nan if get_absolute_median_z_score(x, median, median_dev) >= 2 else x)

And the offending line is df[col] = df[col].apply(lambda x: np.nan if get_absolute_median_z_score(x, median, median_dev) >= 2 else x) as per this error:

A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy df[col] = df[col].apply(lambda x: np.nan if get_absolute_median_z_score(x, median, median_dev) >= 2 else x)

What I don't understand is that I see this pattern all over the place, using something like df['a'] = df['a'].apply(lambda x: ...), so I can't imagine all of them are doing it wrong.

Am I doing it wrong? What is the best way to do this? I want to modify the original dataframe.

Thanks for your help.

coolboyjules
  • 2,300
  • 4
  • 22
  • 42
  • 3
    It is not due to the apply method but the fact that you reassign a column of your dataframe. You can use `copy()`or simply disable the warning. – Thomas Grsp Aug 10 '17 at 14:05
  • 1
    So am I modifying the original dataframe in that line? That is what I want. Or am I creating a new dataframe and not modifying the passed `df` (I don't want this) – coolboyjules Aug 10 '17 at 14:09
  • In fact, you are modifying the original dataframe, i give you more insight in an answer. – Thomas Grsp Aug 10 '17 at 14:24

2 Answers2

20

Make sure that df is a copy of another data frame. In that case, you should write your code like

df = df_test.copy()

This makes sure df is a copy and not a view.

Learn more about this warning from the below link

https://www.youtube.com/watch?v=4R4WsDJ-KVc

wisbucky
  • 33,218
  • 10
  • 150
  • 101
  • 1
    Thanks, that is actually right, I solved my warning with a copy. In my case I had: `df = df_original['col1', 'col2']` here add `.copy()`. Then this will not generate warning anymore: `df['col1'] = df['col1'].apply(lambda x: x)` – steco Jun 15 '18 at 13:33
19

The problem is due to the reassignement and not the fact that you use apply.

SettingWithCopyWarning is a warning that chained-indexing has been detected in an assignment. It does not necessarily mean anything has gone wrong.

To avoid, the warning, as adviced use .loc like this

df.loc[:, col] = df[col].apply(...)

Thomas Grsp
  • 482
  • 1
  • 3
  • 14
  • 1
    My knowledge ends here, maybe read the docs about copy in pandas. In case, you want to disable the warning you can use `pd.options.mode.chained_assignment = None` – Thomas Grsp Aug 10 '17 at 14:32
  • 24
    @coolboyjules Sometimes you can get the warning even on a line that uses `loc` (as here) because the DataFrame you are working with already (`df`) is already ambiguously a copy or view before it goes into your function, so the line you'd need to change would be in the code above somewhere (usually adding a `.copy()` on some other operation). It's annoying, but there it is. – Ajean Aug 10 '17 at 15:59
  • 2
    This answer didn't solve my issue, instead, I found another answer here (https://stackoverflow.com/a/60885847/8046546), which resolves the error by adding .reset_index(drop=True) before to the dataframe – Mapotofu Jan 19 '21 at 15:47