I am exploring PCA in Scikit-learn (0.20 on Python 3) using Pandas for structuring my data. When I apply a test/train split (and only when), my input labels seem to no longer match up with the PCA output.
import pandas
import sklearn.datasets
from matplotlib import pyplot
import seaborn
def load_bc_as_dataframe():
data = sklearn.datasets.load_breast_cancer()
df = pandas.DataFrame(data.data, columns=data.feature_names)
df['diagnosis'] = pandas.Series(data.target_names[data.target])
return data.feature_names.tolist(), df
feature_names, bc_data = load_bc_as_dataframe()
from sklearn.model_selection import train_test_split
# bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train = bc_data
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
bc_pca_raw = pca.fit_transform(bc_train[feature_names])
bc_pca = pandas.DataFrame(bc_pca_raw, columns=('PCA 1', 'PCA 2'))
bc_pca['diagnosis'] = bc_train['diagnosis']
seaborn.scatterplot(
data=bc_pca,
x='PCA 1',
y='PCA 2',
hue='diagnosis',
style='diagnosis'
)
pyplot.show()
This looks reasonable, and that's borne out by accurate classification results. If I replace the bc_train = bc_data
with a train_test_split()
call (even with test_size=0
), my labels seem to no longer correspond to the original ones.
I realise that train_test_split()
is shuffling my data (which I want it to, in general), but I don't see why that would be a problem, since the PCA and the label assignment use the same shuffled data. PCA's transformation is just a projection, and while it obviously doesn't retain the same features (columns), it shouldn't change which label goes with which frame.
How can I correctly relabel my PCA output?