I have a data set of responses to a survey administered in three different locations. The number of responses from each of these, however, do not represent the underlying population of each area. I want to adjust (weight) the data in order to run a cluster analysis.
Before weighting, I am conducting a regular cluster analysis for 3 clusters, using the Ward Method. The data set is in the following arrangement:
Province Gender Age Marital_Status Occupation Q1 Q2 Q3 ...
1
2
3
...
mydata <- read.csv(file=".csv", header=TRUE, sep=",")
mydata <- na.omit(mydata)
mydata <- scale(mydata)
d <- dist(mydata, method = "euclidean")
fit <- hclust(d, method="ward.D2")
plot(fit, labels=FALSE)
groups <- cutree(fit, k=3)
rect.hclust(fit, k=3, border="green")
table(groups)
The current results are three groups:
1 2 3
243 114 143
I can see to which of the three clusters does each entry (response) in the data set belongs, but because the variable "province" and "age" was not randomly selected, but chosen in a proportion that is not representative of the underlying population, I want to adjust for these weights and see if the clusters differ in size and quality after weighting.
Thank you.