0

I have a data set of responses to a survey administered in three different locations. The number of responses from each of these, however, do not represent the underlying population of each area. I want to adjust (weight) the data in order to run a cluster analysis.

Before weighting, I am conducting a regular cluster analysis for 3 clusters, using the Ward Method. The data set is in the following arrangement:

  Province  Gender  Age  Marital_Status Occupation Q1  Q2  Q3 ...
1 
2
3
...

mydata <- read.csv(file=".csv", header=TRUE, sep=",")
mydata <- na.omit(mydata)
mydata <- scale(mydata)
d <- dist(mydata, method = "euclidean")
fit <- hclust(d, method="ward.D2") 
plot(fit, labels=FALSE)
groups <- cutree(fit, k=3)
rect.hclust(fit, k=3, border="green")
table(groups)

The current results are three groups:

  1   2   3 
243 114 143 

I can see to which of the three clusters does each entry (response) in the data set belongs, but because the variable "province" and "age" was not randomly selected, but chosen in a proportion that is not representative of the underlying population, I want to adjust for these weights and see if the clusters differ in size and quality after weighting.

Thank you.

kath
  • 7,624
  • 17
  • 32
  • [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. That includes a sample of data, all necessary code, and a clear explanation of what you're trying to do and what hasn't worked. You might also have some luck looking at [stats.se] – camille Aug 17 '19 at 17:38
  • Have you tried implementing it yourself? With Ward you may also want to take the variance *within* each location into account. – Has QUIT--Anony-Mousse Aug 18 '19 at 09:03

0 Answers0