0

I have a problem understanding and replicating scale function in R. I know that it presents z-standardization (when all arguments are default), but I am having hard time obtaining exactly the same scaled values for particular cluster after clustering is performed. Here is an example:

Let's define a dataset:

set.seed(16)
nc=10
nr=10000
df1 = data.frame(matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr))
head(df1, n=4)

Before clustering I need to scale the data:

for_clst_km = scale(df1) #standardization with z-scores
Clusters <- kmeans(for_clst_km, 6, iter.max = 100000, nstart = 5)

After clustering is performed, I can obtain scaled values for cluster 3:

ver1=for_clst_km[Clusters$cluster==3,]

I now want to replicate ver1 using data from the original dataset df1:

cluster3 = df1[Clusters$cluster==3,]
cluster3$cluster = NULL

for_clst_means = apply(df1,2,mean)
for_clst_sd = apply(df1,2,sd)

ver2 = (sweep(cluster3, 2, for_clst_means))/for_clst_sd

ver3 = apply(cluster3, 2, function(x) ((x-for_clst_means)/for_clst_sd))

Finally when comparing those 3 versions I see they are different.

all(ver1 == ver2)
[1] FALSE

all(ver1 == ver3)
[1] FALSE

Why is that? And how can I obtain ver2 or ver3 to be exactly the same as ver1. Thanks!

Makaroni
  • 880
  • 3
  • 15
  • 34
  • 1
    I'm not reading your code super closely, but it possibly like a numerical precision issue. Maybe use `all.equal` which allows for a bit of tolerance in the comparison. Also see https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal. If this is the reason, then this is a dupe. – lmo Sep 23 '17 at 21:26
  • Please purge “clustering" from the question go make it more understandable. I don't think it relates to clustering anymore, but you can use `df1` instead of `cluster3`? Also check if ver2 equals ver3, and print the sum of differences. And you may want to **look at your data**, too. Don't fly blind. Large difference or tiny? Systematic or random? NAs? – Has QUIT--Anony-Mousse Sep 23 '17 at 21:36

0 Answers0