Is this example data cleaning code updating the pandas dataframe?

Question

In this article on predicting values with linear regression there's a cleaning step

# For beginning, transform train['FullDescription'] to lowercase using text.lower()
train['FullDescription'].str.lower()

# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)

This isn't actually assigning the changes to the dataframe, is it? But if I try something like this...

train['FullDescription'] = train['FullDescription'].str.lower()
train['FullDescription'] = train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)

Then I get a warning...

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

What's the right way to apply these transformations? Are they in fact already being applied? A print(train['FullDescription']) seems to show me they're not.

EDIT: @EdChum and @jezrael are very much onto something about missing code. When I'm actually trying to run this, my data needs to be split into test and train sets.

from sklearn.model_selection import train_test_split
all_data = pandas.read_csv('salary.csv')
train, test = train_test_split(all_data, test_size=0.1)

That's what seems to be causing this error. If I make the next line

train = train.copy()
test = test.copy()

then everything is happy.

You may be wondering if I shouldn't then just apply this step to all_data, which works, but then lower down in the code train['Body'].fillna('nan', inplace=True) still causes an error. So it seems indeed the problem is with train_test_split not creating copies.

What is code before? Only `train = pandas.read_csv('salary-train.csv')` ? — jezrael, Aug 08 '19 at 10:34
That is the correct way, that warning will appear if you filtered the original df, you'll need to post your full code in order for others to try to explain/reproduce your issue — EdChum, Aug 08 '19 at 10:34
Check [this](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas) — jezrael, Aug 08 '19 at 10:35
@EdChum this is not my code, please see the first line of the question for a link to the article which has this code in it. — Martin Burch, Aug 08 '19 at 10:39
@jezrael is there a specific answer in that question you can recommend for this situation, updating the dataframe with a cleaned column? There's a whole lot going on in that question that makes it impenetrable. — Martin Burch, Aug 08 '19 at 10:46
Check answer of coldspeed - [link](https://stackoverflow.com/a/53954986/2901002) — jezrael, Aug 08 '19 at 10:47
@jezrael that answer is all about filtering, but I'm not doing any filtering. All the values are being modified. So I don't see how to use `loc` here. — Martin Burch, Aug 08 '19 at 10:50
@MartinBurch - Is possible see all your code, before your posted code `train['FullDescription'] = train['FullDescription'].str.lower() train['FullDescription'] = train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)` ? — jezrael, Aug 08 '19 at 10:51
The linked article would never update the df as written, you'd have to assign back which is what you showed. However, that warning is only raised when you filter or take a slice of the original df, without your code that reproduces that then I can't comment any further — EdChum, Aug 08 '19 at 11:02
Thank you very much for your help. You're both right, more code was involved. I've edited my question to add all this at the bottom. — Martin Burch, Aug 08 '19 at 11:09
This line `train, test = train_test_split(all_data, test_size=0.1)` is producing a filtered view on your original df, what you should do is apply the filtering to the original df first, and then split: `all_data['FullDescription'] = all_data['FullDescription'].str.lower()` etc.. — EdChum, Aug 08 '19 at 12:13

IMCoins · Answer 1 · 2019-08-08T10:55:53.687

0

The right way to apply these transformations would be...

df.loc[:, 'FullDescription'] = ...

More informations about this would be here. This is a page from the pandas documentation, all the way to the bottom. Quoting...

 def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo

You can also find extra-reasons of Why to use .loc here. Long story short : Explicit is better than implicit. And while df['some_column'] is not immediatly clear about the intent, usingdf.loc['some_column'] is.

I don't really know how to explain it in a simple way, but if you have further questions or if you think I could make my answer more explicit/eloquent, tell me. :)

edited Aug 08 '19 at 10:55

answered Aug 08 '19 at 10:32

IMCoins

3,149
1
10
25

Can you show code for `df['some_column']` with `SettingWithCopyWarning` vs `df.loc[:, 'some_column']` no warning? – jezrael Aug 08 '19 at 11:00
@jezrael I cannot reproduce, so I must be wrong somewhere but cannot figure out where – IMCoins Aug 08 '19 at 11:16
1

ya, reason is something else, check last OP edit. – jezrael Aug 08 '19 at 11:21

Is this example data cleaning code updating the pandas dataframe?

1 Answers1