2

I had a large dataset that contains more than 300,000 rows/observations and 22 variables. I used the CLARA method for the clustering and plotted the results using fviz_cluster. Using the silhouette method, I got 10 as my number of clusters and from there I applied it to my CLARA algorithm.

clara.res <- clara(df, 10, samples = 50,trace = 1,sampsize = 1000, pamLike = TRUE)

str(clara.res)
List of 10
 $ sample    : chr [1:1000] "100046" "100303" "10052" "100727" ...
 $ medoids   : num [1:10, 1:22] 0.925 0.125 0.701 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:10] "193751" "137853" "229261" "257462" ...
  .. ..$ : chr [1:22] "COD" "DMW" "HER" "SPR" ...
 $ i.med     : int [1:10] 104171 42062 143627 174961 300065 13836 192832 207079 185241 228575
 $ clustering: Named int [1:302251] 1 1 1 2 3 4 5 3 3 3 ...
  ..- attr(*, "names")= chr [1:302251] "1" "10" "100" "1000" ...
 $ objective : num 0.37
 $ clusinfo  : num [1:10, 1:4] 71811 40181 46271 10155 31309 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:4] "size" "max_diss" "av_diss" "isolation"
 $ diss      : 'dissimilarity' num [1:499500] 1.392 2.192 0.937 2.157 1.643 ...
  ..- attr(*, "Size")= int 1000
  ..- attr(*, "Metric")= chr "euclidean"
  ..- attr(*, "Labels")= chr [1:1000] "100046" "100303" "10052" "100727" ...
 $ call      : language clara(x = df, k = 10, samples = 50, sampsize = 1000, trace = 1, pamLike = TRUE)
 $ silinfo   :List of 3
  ..$ widths         : num [1:1000, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:1000] "83395" "181310" "34452" "42991" ...
  .. .. ..$ : chr [1:3] "cluster" "neighbor" "sil_width"
  ..$ clus.avg.widths: num [1:10] 0.645 0.408 0.487 0.513 0.839 ...
  ..$ avg.width      : num 0.612
 $ data      : num [1:302251, 1:22] 1 1 1 0.366 0.35 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:302251] "1" "10" "100" "1000" ...
  .. ..$ : chr [1:22] "COD" "DMW" "HER" "SPR" ...
 - attr(*, "class")= chr [1:2] "clara" "partition"

For the plot:

fviz_cluster(clara.res,
             palette = c(
"#004c6d",
"#00a1c1",
"#ffc334",
"#78ab63",
"#00ffff",
"#00cfe3",
"#6efa75",
"#cc0089",
"#ff9509",
"#ffb6de"
             ), # color palette
             ellipse.type = "t",geom = "point",show.clust.cent = TRUE,repel = TRUE,pointsize = 0.5,
             ggtheme = theme_classic()
)+ xlim(-7, 3) + ylim (-5, 4) + labs(title = "Plot of clusters")

The result: enter image description here

I reckoned that this cluster plot is based on PCA and have been trying to figure out which variables in my original data were chosen as Dim1 and Dim2 or what these x and y-axis represent. Can somebody help me how to find out these Dim1 and Dim2 and eigenvalues/variance of the whole Dim that exist without running PCA separately? I saw there are some other functions/packages for PCA such as get_eigenvalue in factoextra and FactomineR, but it seemed that will require me to use the PCA algorithm from the beginning? How can I integrate it directly with my CLARA results?

Also, my Dim1 only consists of 12.3% and Dim2 8.8%, does it mean that these variables are not representative enough or? considering that I would have 22 dimensions in total (from my 22 variables), I think it's alright, no? I am not sure how these percentages of Dim1 and Dim2 affect my cluster results. I was thinking to do the screeplot from my CLARA results but I also can't figure it out.

I'd appreciate any insights.

  • 1
    It does not appear the `fviz_cluster` returns the results of the principal components so you will have to use `prcomp` on your data to construct your own. The size of the components suggests that the correlations between the variables are modest (but the expected size of each pc would be about 4.4% if they were all completely uncorrelated. That does not necessarily affect the cluster results, only your ability to visualize them. – dcarlson Jan 17 '22 at 18:40
  • Indeed. I tried the prcomp again and it worked. I thought when I did the PCA separately it would give me different results, apparently not. So thank you! @dcarlson – samudra_biru Jan 17 '22 at 19:53

0 Answers0