Minimal Example:
Consider this dataframe temp
:
temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> temp
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
Now, trying to shuffle each column at a time, in a for loop.
>>> for i in temp.columns:
... np.random.shuffle(temp.loc[:,i])
... print(temp)
...
A B C
0 8 2 3
1 3 3 4
2 9 4 5
3 6 5 6
4 4 6 7
5 10 7 8
6 7 8 9
7 1 9 10
8 2 10 11
9 5 11 12
A B C
0 8 7 3
1 3 9 4
2 9 8 5
3 6 10 6
4 4 4 7
5 10 11 8
6 7 5 9
7 1 3 10
8 2 2 11
9 5 6 12
A B C
0 8 7 6
1 3 9 8
2 9 8 4
3 6 10 10
4 4 4 7
5 10 11 11
6 7 5 5
7 1 3 3
8 2 2 12
9 5 6 9
This works perfectly.
Specific Example:
Now, if I want to get a part of this dataframe, for training and testing purposes, then I'll use the train_test_split
function from sklearn.model_selection
.
>>> from sklearn.model_selection import train_test_split
>>> temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> y = [i for i in range(16,26)]
>>> len(y)
10
>>> X_train,X_test,y_train,y_test = train_test_split(temp,y,test_size=0.2)
>>> X_train
A B C
2 3 4 5
6 7 8 9
8 9 10 11
0 1 2 3
7 8 9 10
3 4 5 6
1 2 3 4
9 10 11 12
Now, we've obtained our X_train
dataframe. In order to shuffle it's each column:
>>> for i in X_train.columns:
... np.random.shuffle(X_train.loc[:,i])
... print(X_train)
...
Which, unfortunately results in an error.
Error:
sys:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "mtrand.pyx", line 4852, in mtrand.RandomState.shuffle
File "mtrand.pyx", line 4855, in mtrand.RandomState.shuffle
File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\series.py", line 623, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2560, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\_libs\index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 817, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 4
Tracing for the problem and it's solution:
Under the SettingWithCopyWarning
, I found this question, which had this line under it's first answer:
However it could create a copy which updates a copy of
data['amount']
which you would not see. Then you would be wondering why it is not updating.
But, if this was the case, then why did the code work for the first case?
It's also given in the answer that:
Pandas returns a copy of an object in almost all method calls. The inplace operations are a convience operation which work, but in general are not clear that data is being modified and could potentially work on copies.
So, instead of using np.random.shuffle
we can use np.random.permutation
, as shown in this answer. So:
>>> for i in X_train.columns:
... X_train.loc[:,i] = np.random.permutation(X_train.loc[:,i])
... print(X_train)
...
But, I get the SettingWithCopyWarning
again, and the answer too.
C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexing.py:621: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item_labels[indexer[info_axis]]] = value
A B C
2 10 4 5
6 9 8 9
8 2 10 11
0 8 2 3
7 1 9 10
3 3 5 6
1 4 3 4
9 7 11 12
A B C
2 10 5 5
6 9 11 9
8 2 4 11
0 8 9 3
7 1 3 10
3 3 8 6
1 4 10 4
9 7 2 12
A B C
2 10 5 10
6 9 11 5
8 2 4 11
0 8 9 3
7 1 3 4
3 3 8 6
1 4 10 12
9 7 2 9
This can be a workaround.
Questions:
- Why does the code work for the first case, and not the second case, when I use
train_test_split
? - Why do I still get the
SettingWithCopyWarning
when I'm not using the inplace shufflernp.random.shuffle
?
Requests for Suggestions:
- Is there a better (easy to use/error free/faster) method to do column shuffling?