0

I have two points feat_left, feat_right obtained from siamese network and I plotted these points in x,y coordinates as shown below.enter image description here

Here is the python script

import json
import matplotlib.pyplot as plt
import numpy as np



data = json.load(open('predictions-mnist.txt'))

n=len(data['outputs'].items())
label_list = np.array(range(n))
feat_left = np.random.random((n,2))


count=1

for key,val in data['outputs'].items():
    feat = data['outputs'][key]['feat_left']
    feat_left[count-1] = feat
    key = key.split("/")
    key = int(key[6])
    label_list[count - 1] = key
    count = count + 1


f = plt.figure(figsize=(16,9))

c = ['#ff0000', '#ffff00', '#00ff00', '#00ffff', '#0000ff',
     '#ff00ff', '#990000', '#999900', '#009900', '#009999']

for i in range(10):
    plt.plot(feat_left[label_list==i,0].flatten(), feat_left[label_list==i,1].flatten(), '.', c=c[i])
plt.legend(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])
plt.grid()
plt.show()

Now I want to calculate the centriod and then purity of each cluster.

Shai
  • 111,146
  • 38
  • 238
  • 371
cpwah
  • 141
  • 1
  • 3
  • 13
  • 1
    how do you define "accuracy" of a cluster? – Shai Jun 01 '17 at 08:55
  • You can use k-means (k=10), or have a look on different [clustering method](http://scikit-learn.org/stable/modules/clustering.html) provided by the module `sklearn.cluster` – Nuageux Jun 01 '17 at 08:55
  • I am following this article [Evaluation of clustering](https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html) @Shai – cpwah Jun 01 '17 at 09:01
  • you can also give your suggestion to evaluate cluster @Shai – cpwah Jun 01 '17 at 09:37

1 Answers1

2

The centroid is simply the mean:

centorids = np.zeros((10,2), dtype='f4')
for i in xrange(10):
    centroids[i,:] = np.mean( feat_left[label_list==i, :2], axis=0 )

As for accuracy, you can compute the mean square error (distance) from the centroid:

sqerr = np.zeros((10,), dtype='f4')
for i in xrange(10):
    sqerr[i] = np.sum( (feat_left[label_list==i, :2]-centroids[i,:])**2 )

Computing purity:

def compute_cluster_purity(gt_labels, pred_labels):
  """
  Compute purity of predicted labels (pred_labels), given 
  the ground-truth labels (gt_labels).

  Assuming gt_labels and pred_labels are both lists of int of length n
  """
  n = len(gt_labels) # number of elements
  assert len(pred_labels) == n
  purity = 0
  for l in set(pred_labels):
    # for predicted label l, what are the gt_labels of this cluster?
    gt = [gt_labels[i] for i, il in enumerate(pred_labels) if il==l]
    # most frequent gt label in this cluster:
    mfgt = max(set(gt), key=gt.count)
    purity += gt.count(mfgt) # count intersection between most frequent ground truth and this cluster
  return float(purity)/n

See this answer for more details on selecting the most frequent label in a cluster.

Shai
  • 111,146
  • 38
  • 238
  • 371
  • I computed the centriod using kmeans clustering and I am more interested to calculate the purity metric. @Shai – cpwah Jun 01 '17 at 14:14
  • @cpwah this is a **different** question: for purity, you should have the ground-truth labeling and the assignment of `kmeans` (like `label_list` in your example) – Shai Jun 01 '17 at 14:33
  • yeah I have modified my question. The ground truth labeling is present in `label_list` as indicated by you. @Shai – cpwah Jun 01 '17 at 14:39
  • @cpwah but what is the labeling of `kmeans`? – Shai Jun 01 '17 at 14:43
  • Here it is `kmeans = KMeans(n_clusters=10) kmeans.fit(feat_left) centroids = kmeans.cluster_centers_ labels = kmeans.labels_` – cpwah Jun 01 '17 at 14:44
  • Is that okay? @Shai – cpwah Jun 01 '17 at 15:57
  • I calculated manually using the [link](https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html) and your function has the same value. I obtained the confusion matrix and then divided with total number. @Shai – cpwah Jun 01 '17 at 16:36