In this article on predicting values with linear regression there's a cleaning step
# For beginning, transform train['FullDescription'] to lowercase using text.lower()
train['FullDescription'].str.lower()
# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
This isn't actually assigning the changes to the dataframe, is it? But if I try something like this...
train['FullDescription'] = train['FullDescription'].str.lower()
train['FullDescription'] = train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
Then I get a warning...
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
What's the right way to apply these transformations? Are they in fact already being applied? A print(train['FullDescription'])
seems to show me they're not.
EDIT: @EdChum and @jezrael are very much onto something about missing code. When I'm actually trying to run this, my data needs to be split into test and train sets.
from sklearn.model_selection import train_test_split
all_data = pandas.read_csv('salary.csv')
train, test = train_test_split(all_data, test_size=0.1)
That's what seems to be causing this error. If I make the next line
train = train.copy()
test = test.copy()
then everything is happy.
You may be wondering if I shouldn't then just apply this step to all_data
, which works, but then lower down in the code train['Body'].fillna('nan', inplace=True)
still causes an error. So it seems indeed the problem is with train_test_split
not creating copies.