sklearn train_test_split returns some elements in both test/train

Question

I have a data-set X with 260 unique observations.

when running x_train,x_test,_,_=test_train_split(X,y,test_size=0.2) I would assume that [p for p in x_test if p in x_train] would be empty, but it is not. Actually it turns out that only two observations in x_test is not in x_train.

Is that intended or...?

EDIT (posted the data I am using):

from sklearn.datasets import load_breast_cancer 
from sklearn.model_selection import train_test_split as split
import numpy as np

DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])

x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)

len([p for p in x_test if p in x_train]) #is not 0

EDIT 2.0: Showing that the test works

a=np.array([[1,2,3],[4,5,6]])
b=np.array([[1,2,3],[11,12,13]])

len([p for p in a if p in b]) #1

can you post a full reproducible example? the intersection of train and test should be empty. — amdex, Dec 01 '19 at 18:26
If I use the answer from https://stackoverflow.com/questions/38674027/find-the-row-indexes-of-several-values-in-a-numpy-array i.e `np.where((x_test==x_train[:,None]).all(-1))[1]` that returns `[]` (that the intersect is empty). But I do think that `[p for p in x_test if p in x_train]` should do the trick — CutePoison, Dec 01 '19 at 18:37
clearly `[p for p in x_test if p in x_train]` does not do the trick, because it gives different results than `np.where((x_test==x_train[:,None]).all(-1))[1]`. As to why this is, see here: https://stackoverflow.com/questions/39452843/in-operator-for-numpy-arrays — amdex, Dec 01 '19 at 19:03
Okay, but we do agree that if `x_test` has some values in `x_train` then `[p for p in x_test if p in x_train]` would be non-empty (and logically the other way round)? I mean that the way I test if the two sets intersection is empty is correct — CutePoison, Dec 01 '19 at 19:05
Right, I agree, and you would be correct if you were working with lists. I also agree that this is counter-intuitive. The test assumes assumes that the `in` operator works like it does on, say a list of tuples, but it doesn't for numpy arrays. As an example, consider `[p for p in [tuple(x) for x in x_test] if p in [tuple(x) for x in x_train]]`, which does what you want (and gives an empty intersection, as expected) — amdex, Dec 01 '19 at 19:06
I cannot see if the problem is in the test_train_split or where it is. Because the set should be unique (`len(X)==len(np.unique(X,axis=0)`) thus I can really not see where the problem is — CutePoison, Dec 01 '19 at 19:15
the problem is definitely not in `train_test_split`, it is in your use of the `in` operator in `len([p for p in x_test if p in x_train])`. The `in` operator does not do what you want it to do in this case, because both `x_test` and `x_train` are arrays. — amdex, Dec 01 '19 at 19:16
If you see my edit 2.0, that should show that it works with np.arrays (or did I missunderstand you?) — CutePoison, Dec 01 '19 at 19:19
Try the same with: `a=np.array([[1,2,3],[4,5,6]]); b=np.array([[5,5,5]])`, it will still return `True`. The `in` operator does not do what you think it does. — amdex, Dec 01 '19 at 19:23
Aaaah! Now I see! Thank you so much - would you post that as an answer? — CutePoison, Dec 01 '19 at 19:25

score 1 · Accepted Answer · answered Dec 01 '19 at 19:32

This is not a bug with the implementation of train_test_split in sklearn, but a weird peculiarity of how the in operator works on numpy arrays. The in operator first does an elementwise comparison between two arrays, and returns True if ANY of the elements match.

import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5]])
a in b # True

The correct way to test for this kind of overlap is using the equality operator and np.all and np.any. As a bonus, you also get the indices that overlap for free.

import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5], [7, 8, 9]])
a in b # True

z = np.any(np.all(a == b[:, None, :], -1))  # False

a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [1, 2, 3], [7, 8, 9]])
a in b # True

overlap = np.all(a == b[:, None, :], -1)
z = np.any(overlap)  # True
indices = np.nonzero(overlap)  # (1, 0)

seralouk · Answer 2 · 2019-12-01T19:52:35.103

-1

You need to check using the following:

from sklearn.datasets import load_breast_cancer 
from sklearn.model_selection import train_test_split as split
import numpy as np

DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])

x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)

len([p for p in x_test.tolist() if p in x_train.tolist()])
0

Using x_test.tolist() the in operator will work as intended.

Reference: testing whether a Numpy array contains a given row

edited Dec 01 '19 at 19:52

answered Dec 01 '19 at 18:34

seralouk

30,938
9
118
133

So if you use the breast_cancer data you get the same behaviour? I first thought that maybe the data set isn't unique but `len(X)==len(np.unique(X)) #True`, so I cannot see what can be the problem? – CutePoison Dec 01 '19 at 19:03
this is not an sklearn bug, so you should not report it on the issue tracker. – amdex Dec 01 '19 at 19:05

sklearn train_test_split returns some elements in both test/train

2 Answers2