0

I'm performing cluster analysis in R on data regarding students interests in different career options. There are 11 career options and each one is to be rated from 5. There are 80 students in total. I have performed hierarchical cluster analysis using complete linkage. Surprisingly, the first 40 students happen to be in the first cluster while the last 40 students happen to be in the second cluster. I reran the code with data for only 50 students and the result was the same (the first 25 in cluster 1 and the rest of the 25 in Cluster 2). Can anyone please explain why is this happening?

#Cluster Analysis
library(tidyverse)  # data manipulation
library(cluster)    # clustering algorithms
library(factoextra) # clustering visualization
library(dendextend) # for comparing two dendrograms
#load data, , row.names=1
career_datatp<-read.csv("/Users/fatimaperwaizkhan/Documents/NSF Research Work/Results/career_data50.csv", header = TRUE, sep=",")
career_data <- na.omit(career_datatp)
#career_data <- scale(career_data)
head(career_data)

# Dissimilarity matrix
d <- dist(career_data, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )

# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1, main = 'Cluster Analysis of Careers')

# Compute with agnes
hc2 <- agnes(d, method = "complete")
pltree(hc2, cex = 0.6, hang = -1, main = "Dendrogram of agnes") 
# Agglomerative coefficient
hc2$ac

Here is my Data Set Dendogram

I wasn't expecting them to be categorized on the basis of serial number

  • What are the columns in "career_data"? Your code is using all of the columns in the data frame and thus some unimportant columns. Also you probably haven't normalized the data so a difference in 40 in serial number will swamp out any of the small differences in the remaining columns. In the future do not post images of your code. Please cut and paste into the question. Also please provide a sample of your data so we are not guessing at process. – Dave2e Aug 25 '23 at 21:11
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please [do not post code or data in images](https://meta.stackoverflow.com/q/285551/2372064) – MrFlick Aug 25 '23 at 21:14
  • @Dave2e why do I need to standardize if all the careers are rated from 1 to 5? Also, I have included the data set and code. – Fatima Perwaiz Khan Aug 25 '23 at 22:20
  • Ok, once you remove the first column “students” you don’t have to normalize in this case, but this a rare case. Most datasets generally have a diverse columns of values, like income and years of education. – Dave2e Aug 25 '23 at 23:47
  • @Dave2e the students column has ID that I'm using to identify students. How can I tell R to not use the students column for cluster analysis? – Fatima Perwaiz Khan Aug 26 '23 at 17:33
  • 1
    The default output is to use the row index number cluster data structure. The work arounds are: Perform the `hclust(career_data[,-1])` without the student index number and then in the `plot()` function add in the "labels = careerdata$sudents" option to properly label the values. The other way is use the student index number as `row.names()` and remove that column from the data frame. – Dave2e Aug 26 '23 at 19:55

0 Answers0