0

I have a column with strings and I'm trying to find number of tokens in it and then creating a new column in the same dataframe with those values.

 data['tokens'] = data['query'].str.split().apply(len)

I get SettingWithCopyWarning. I'm not sure how to fix this. I understand I need to use .loc[row_indexer,col_indexer] = value but don't get how that would apply to this.

John Constantine
  • 1,038
  • 4
  • 15
  • 43
  • 1
    This is usually because your `data` is a part of a bigger dataframe. Do `data = data.copy()` right after you created `data` will help. – Quang Hoang Jun 23 '20 at 19:50
  • Here's a [thread](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas?rq=1) that talks about why this might crop up. If you know that you aren't running into the pitfalls outlined in the post by doing the operation you're performing, you can turn off the warning with: `pd.options.mode.chained_assignment = None` – ncasale Jun 23 '20 at 20:02
  • You never want to turn off the `SettingWithCopyWarning`! It's really important to know whether you're modifying the thing you meant to modify or not. – Ray Johns Jun 23 '20 at 20:13
  • @QuangHoang I still have the issue after doing that. – John Constantine Jun 23 '20 at 20:14
  • @JohnConstantine, see my answer below. I think something is going on with how you constructed your DataFrame. I made a DataFrame of strings and tried exactly this method of setting a tokens column and it worked. – Ray Johns Jun 23 '20 at 20:31

1 Answers1

1

a SettingWithCopyWarning happens when you have made a copy of a slice of a DataFrame, but pandas thinks you might be trying to modify the underlying object.

To fix it, you need to understand the difference between a copy and a view. A copy makes an entirely new object. When you index into a DataFrame, like:

data['query'].str.split().apply(len)

or

data['tokens']

you're creating a new DataFrame that is a modified copy of the original one. If you modify this new copy, it won't change the original data object. You can check that with the _is_view attribute, which will return a boolean value.

data['tokens']._is_view

On the other hand, when you use the .at, .loc, or .iloc methods, you are taking a view of the original DataFrame. That means you're subsetting it according to some criteria and manipulating the original object itself.

Pandas raises the SettingWithCopyWarning when you are modifying a copy when you probably mean to be modifying the original. To avoid this, you can explicitly use .copy() on the data that you are copying, or you can use .loc to specify the columns you want to modify in data (or both).

Since it depends a lot on what transformations you've done to your DataFrame already and how it is set up, it's hard to say exactly where and how you can fix it without seeing more of your code. There's unfortunately no one-size-fits-all answer. If you can post more of your code, I'm happy to help you debug it.

One thing you might try is creating an intermediate lengths object explicitly, in case that is the problem. So your code would look like:

lengths = data['query'].str.split().apply(len).copy()
data['tokens'] = lengths
Ray Johns
  • 768
  • 6
  • 14