0

Minimal Example:
Consider this dataframe temp:

temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> temp
    A   B   C
0   1   2   3
1   2   3   4
2   3   4   5
3   4   5   6
4   5   6   7
5   6   7   8
6   7   8   9
7   8   9  10
8   9  10  11
9  10  11  12

Now, trying to shuffle each column at a time, in a for loop.

>>> for i in temp.columns:
...     np.random.shuffle(temp.loc[:,i])
...     print(temp)
...
    A   B   C
0   8   2   3
1   3   3   4
2   9   4   5
3   6   5   6
4   4   6   7
5  10   7   8
6   7   8   9
7   1   9  10
8   2  10  11
9   5  11  12
    A   B   C
0   8   7   3
1   3   9   4
2   9   8   5
3   6  10   6
4   4   4   7
5  10  11   8
6   7   5   9
7   1   3  10
8   2   2  11
9   5   6  12
    A   B   C
0   8   7   6
1   3   9   8
2   9   8   4
3   6  10  10
4   4   4   7
5  10  11  11
6   7   5   5
7   1   3   3
8   2   2  12
9   5   6   9

This works perfectly.
Specific Example:

Now, if I want to get a part of this dataframe, for training and testing purposes, then I'll use the train_test_split function from sklearn.model_selection.

>>> from sklearn.model_selection import train_test_split
>>> temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> y = [i for i in range(16,26)]
>>> len(y)
10
>>> X_train,X_test,y_train,y_test = train_test_split(temp,y,test_size=0.2)
>>> X_train
    A   B   C
2   3   4   5
6   7   8   9
8   9  10  11
0   1   2   3
7   8   9  10
3   4   5   6
1   2   3   4
9  10  11  12

Now, we've obtained our X_train dataframe. In order to shuffle it's each column:

>>> for i in X_train.columns:
...     np.random.shuffle(X_train.loc[:,i])
...     print(X_train)
...

Which, unfortunately results in an error.
Error:

sys:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "mtrand.pyx", line 4852, in mtrand.RandomState.shuffle
  File "mtrand.pyx", line 4855, in mtrand.RandomState.shuffle
  File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\series.py", line 623, in __getitem__
    result = self.index.get_value(self, key)
  File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2560, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas\_libs\index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 817, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 4

Tracing for the problem and it's solution:

Under the SettingWithCopyWarning, I found this question, which had this line under it's first answer:

However it could create a copy which updates a copy of data['amount'] which you would not see. Then you would be wondering why it is not updating.

But, if this was the case, then why did the code work for the first case?

It's also given in the answer that:

Pandas returns a copy of an object in almost all method calls. The inplace operations are a convience operation which work, but in general are not clear that data is being modified and could potentially work on copies.

So, instead of using np.random.shuffle we can use np.random.permutation, as shown in this answer. So:

>>> for i in X_train.columns:
...     X_train.loc[:,i] = np.random.permutation(X_train.loc[:,i])
...     print(X_train)
...

But, I get the SettingWithCopyWarning again, and the answer too.

C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexing.py:621: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value
    A   B   C
2  10   4   5
6   9   8   9
8   2  10  11
0   8   2   3
7   1   9  10
3   3   5   6
1   4   3   4
9   7  11  12
    A   B   C
2  10   5   5
6   9  11   9
8   2   4  11
0   8   9   3
7   1   3  10
3   3   8   6
1   4  10   4
9   7   2  12
    A   B   C
2  10   5  10
6   9  11   5
8   2   4  11
0   8   9   3
7   1   3   4
3   3   8   6
1   4  10  12
9   7   2   9

This can be a workaround.


Questions:

  1. Why does the code work for the first case, and not the second case, when I use train_test_split?
  2. Why do I still get the SettingWithCopyWarning when I'm not using the inplace shuffler np.random.shuffle?

Requests for Suggestions:

  1. Is there a better (easy to use/error free/faster) method to do column shuffling?
Mooncrater
  • 4,146
  • 4
  • 33
  • 62

2 Answers2

2

1.Why does the code work for the first case, and not the second case, when I use train_test_split?

Because train_test_split shuffles the rows of X_train. and hence the index of each column is not a range but a set of values

you can see this by inspecting the index of temp and X_train

X_train.index
Int64Index([6, 8, 9, 5, 0, 2, 3, 4], dtype='int64')

temp.index
RangeIndex(start=0, stop=10, step=1)

In the first case, the column can safely be treated as a an array unlike in the second case. if you change the code in the second case to

for i in X_train.columns:
    np.random.shuffle(X_train.loc[:,i].values)
    print(X_train)  

this will not cause an error.

Note that the shuffling in the case you presented will result in a different shuffle for each column. i.e. the data points will get mixed up.

2.Why do I still get the SettingWithCopyWarning when I'm not using the inplace shuffler np.random.shuffle?

I don't get the warning when using the latest version of pandas (0.22.0)

Requests for Suggestions:

  1. Is there a better (easy to use/error free/faster) method to do column shuffling?

I suggest using sample when axis=1, it will shuffle the columns, and the number of samples should be the number of columns. i.e X_train.shape[1]

X_train = X_train.sample(X_train.shape[1],axis=1)

In []: X_train.sample(X_train.shape[1],axis=1)
Out[]: 
    B   A   C
6   8   7   9
9  11  10  12
8  10   9  11
4   6   5   7
5   7   6   8
0   2   1   3
2   4   3   5
3   5   4   6
Mooncrater
  • 4,146
  • 4
  • 33
  • 62
sgDysregulation
  • 4,309
  • 2
  • 23
  • 31
  • +1 Nice answer! But I think you mistook my request for suggestion. I didn't want to shuffle columns themselves. Instead, I wanted to shuffle the values of a column, with the other columns' values remaining the same for each row. – Mooncrater Feb 16 '18 at 17:09
  • 1
    I see. then you can still use sample i.e. `X_train = X_train.apply(lambda col: col.sample(len(col)))` . but like I said this will get the data points mixed up – sgDysregulation Feb 16 '18 at 17:16
0

I also ran into this problem with train_test_split. I used this instead:

np.random.shuffle(x.iloc[:, i].values)

Not sure why it works, but it seems to fix the problem