2

I have a data frame, df, containing the x and y coordinates of a bunch of points. Here's an excerpt:

> tail(df)
            x        y
1495 0.627174 0.120215
1496 0.616036 0.123623
1497 0.620269 0.122713
1498 0.630231 0.110670
1499 0.611844 0.111593
1500 0.412236 0.933250

I am trying to find out the most appropriate number of clusters. Ultimately the goal is to do this with tens of thousands of these data frames, so the method of choice must be quick and can't be visual. Based on those requirements, it seems like the RWeka package is the way to go.

I managed to successfully load the RWeka package (I had to install Java SE Runtime in my computer first) and also RWeka's package XMeans, and run it:

library("RWeka") # requires Java SE Runtime
WPM("refresh-cache") # Build Weka package metadata cache
WPM("install-package", "XMeans") # Install XMeans package if not previously installed

weka_ctrl <- Weka_control( # Create a Weka control object to specify our parameters
  I = 100, # max no iterations overall
  M = 100, # max no iterations in the kmeans loop
  L = 2,   # min no clusters
  H = 5,   # max no clusters
  D = "weka.core.EuclideanDistance", # distance metric
  C = 0.4, S = 1)
x_means <- XMeans(df, control = weka_ctrl) # run algorithm on data

This produces exactly the result I want:

XMeans
======
Requested iterations            : 100
Iterations performed            : 1
Splits prepared                 : 2
Splits performed                : 0
Cutoff factor                   : 0.4
Percentage of splits accepted 
by cutoff factor                : 0 %
------
Cutoff factor                   : 0.4
------

Cluster centers                 : 2 centers

Cluster 0
            0.4197712002617799 0.9346986806282739
Cluster 1
            0.616697959239131 0.11564350951086963

Distortion: 30.580934
BIC-Value : 2670.359509

I can assign each point in my data-frame to a cluster by running x_means$class_ids.

However, I would like to have a way of retrieving the coordinates of the cluster centres. I can see them in the output and write them down manually, but if I am to run tens of thousands of these, I need to be able to have a piece of code that saves them into a variable. I can't seem to subset x_means by using square brackets, so I don't know what else to do.

Thank you so much in advance for your help!

1 Answers1

1

The centers do not seem to be directly stored in the structure that is returned. However, since the structure does tell you which cluster each point belongs to, it is easy to compute the centers. Since you do not provide your data, I will illustrate with the built-in iris data.

As you observed, printing out the result shows the centers. we can use this to check the result.

x_means <- XMeans(iris[,1:4], control = weka_ctrl) 
x_means
## Output truncated to just the interesting part.
Cluster centers                 : 2 centers

Cluster 0
            6.261999999999998 2.872000000000001 4.906000000000001 1.6760000000000006
Cluster 1
            5.005999999999999 3.428000000000001 1.4620000000000002 0.2459999999999999

So here's how to compute that

colMeans(iris[x_means$class_ids==0,1:4])
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       6.262        2.872        4.906        1.676 
colMeans(iris[x_means$class_ids==1,1:4])
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.006        3.428        1.462        0.246 

The results agree.

G5W
  • 36,531
  • 10
  • 47
  • 80