I am working with a relatively big data set (only using about 1/32 of it, but this subset is approx. 50000x9000). In order to perform analysis on this, I have taken several steps to reduce the dimensionality, so that I can then apply some sort of clustering algorithm.
Take a look at the following data frame:
set.seed(340)
df = data.frame(replicate(10,sample(0:10,size = 10,replace = TRUE)))
> df
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 4 9 4 6 9 4 2 5 8 8
2 5 8 2 0 4 6 1 1 0 10
3 1 7 6 3 5 9 6 0 7 1
4 0 6 8 6 6 0 5 5 10 10
5 2 0 5 8 2 10 8 2 1 5
6 3 9 10 2 8 5 2 10 3 10
7 9 0 1 0 6 8 9 6 5 0
8 5 6 9 3 10 4 4 8 6 9
9 8 7 6 2 10 9 9 7 1 10
10 0 7 2 6 1 6 3 2 3 9
Each row represents a person, and each variable says how often that person exhibited that quality. Say I perform principal component analysis on this using princomp(), and collect the first four pc's to use for k means.
pc = princomp(df)
new_df = cbind(pc$loadings[,1],pc$loading[,2],pc$loadings[,3],pc$loadings[,4])
fit = kmeans(new_df,2)
From this I can deduce what cluster exhibits high values of what principal components, where I can use the loadings to see what each principal component is a general measure off. However, I would like to ultimately connect this information to my original data set. Is there a way that I can cluster each person in the original data to a cluster created from the k means on the principal component analysis? Or am I misunderstanding the concept of PCA.