I have a data-set X
with 260 unique observations.
when running x_train,x_test,_,_=test_train_split(X,y,test_size=0.2)
I would assume that
[p for p in x_test if p in x_train]
would be empty, but it is not. Actually it turns out that only two observations in x_test
is not in x_train
.
Is that intended or...?
EDIT (posted the data I am using):
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test if p in x_train]) #is not 0
EDIT 2.0: Showing that the test works
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[1,2,3],[11,12,13]])
len([p for p in a if p in b]) #1