0

This question will be a duplicate unfortunately, but I could not fix the issue in my code, even after looking at the other similar questions and their related answers. I need to split my dataset into train a test a dataset. However, it seems I am doing some error when I add a new column for predicting the cluster. The error that I get is:

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

There are a few questions on this error, but probably I am doing something wrong, as I have not fixed the issue yet and I am still getting the same error as above. The dataset is the following:

    Date    Link    Value   
0   03/15/2020  https://www.bbc.com 1
1   03/15/2020  https://www.netflix.com 4   
2   03/15/2020  https://www.google.com 10
...

I have split the dataset into train and test sets as follows:

import sklearn
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import string as st 

train_data=df.Link.tolist()
df_train=pd.DataFrame(train_data, columns = ['Review'])
X = df_train

X_train, X_test = train_test_split(
        X, test_size=0.4).copy()
X_test, X_val = train_test_split(
        X_test, test_size=0.5).copy()
print(X_train.isna().sum())
print(X_test.isna().sum())

stop_words = stopwords.words('english')

def preprocessor(t):
    t = re.sub(r"[^a-zA-Z]", " ", t())
    words = word_tokenize(t)
    w_lemm = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return w_lemm


vect =TfidfVectorizer(tokenizer= preprocessor)
vectorized_text=vect.fit_transform(X_train['Review'])
kmeans =KMeans(n_clusters=3).fit(vectorized_text)

The lines of code that cause the error are:

cl=kmeans.predict(vectorized_text)
X_train['Cluster']=pd.Series(cl, index=X_train.index)

I think these two questions should have been able to help me with code:

How to add k-means predicted clusters in a column to a dataframe in Python

How to deal with SettingWithCopyWarning in Pandas?

but something is still continuing to be wrong within my code.

Could you please have a look at it and help me to fix this issue before closing this question as duplicate?

still_learning
  • 776
  • 9
  • 32
  • Please, before deleting a question that has been closed as duplicate and re-opening it verbatim, be sure that you show how exactly the linked answers do not resolve your issue, instead of just stating "but something is still continuing to be wrong within my code"; the first answer shows clearly how to break the line giving you error into two steps, i.e.`cl=kmeans.predict(vectorized_text)`, and then subsequently `X_train['Cluster']=pd.Series(cl, index=X_train.index)`. Did you actually try that? If yes, and still getting an error, please **modify** your shown code here accordingly. – desertnaut May 20 '20 at 23:04
  • There was an issue indeed in the first answer (it should be `pd.Series` instead of `Series`); please check and confirm that your issue is resolved using the (edited) answer there. – desertnaut May 20 '20 at 23:07
  • 1
    OK, then, as I said, please **modify** your code to show this (thus justifying that your question is not a duplicate). – desertnaut May 20 '20 at 23:10
  • This isn't an error - it's a warning. Did you go to the link suggested in the warning? Any time you assign data to a copy of a slice of a dataframe, you will get this message. It's not necessarily a bug in your code, which is why it's a warning not an error. – Michael Delgado May 20 '20 at 23:35
  • Yes, I did, but I have not understood what I should do to let it disappear/fix it – still_learning May 20 '20 at 23:35

1 Answers1

2

IMHO, train_test_split gives you a tuple, and when you do copy(), that copy() is a tuple's operation, not pandas'. This triggers pandas' infamous copy warning.

So you only create a shallow copy of the tuple, not the elements. In other words

X_train, X_test = train_test_split(X, test_size=0.4).copy()

is equivalent to:

train_test = train_test_split(X, test_size=0.4)
train_test_copy = train_test.copy()
X_train, X_test = train_test_copy[0], train_test_copy[1]

Since pandas dataframes are pointers, X_train and X_test may or may not point to the same data as X does. If you want to copy the dataframes, you should explicitly force copy() on each dataframe:

X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()

or

X_train, X_test = [d.copy() for d in train_test_split(X, test_size=0.4)]

Then each X_train and X_test is a new dataframe pointing to new memory data.


Update: Tested this code without any warnings:

X = pd.DataFrame(np.random.rand(100,3))
X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()

X_train['abcd'] = 1
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74