I'm performing cluster analysis in R on data regarding students interests in different career options. There are 11 career options and each one is to be rated from 5. There are 80 students in total. I have performed hierarchical cluster analysis using complete linkage. Surprisingly, the first 40 students happen to be in the first cluster while the last 40 students happen to be in the second cluster. I reran the code with data for only 50 students and the result was the same (the first 25 in cluster 1 and the rest of the 25 in Cluster 2). Can anyone please explain why is this happening?
#Cluster Analysis
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering visualization
library(dendextend) # for comparing two dendrograms
#load data, , row.names=1
career_datatp<-read.csv("/Users/fatimaperwaizkhan/Documents/NSF Research Work/Results/career_data50.csv", header = TRUE, sep=",")
career_data <- na.omit(career_datatp)
#career_data <- scale(career_data)
head(career_data)
# Dissimilarity matrix
d <- dist(career_data, method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1, main = 'Cluster Analysis of Careers')
# Compute with agnes
hc2 <- agnes(d, method = "complete")
pltree(hc2, cex = 0.6, hang = -1, main = "Dendrogram of agnes")
# Agglomerative coefficient
hc2$ac
I wasn't expecting them to be categorized on the basis of serial number