1

I have a clustering algorithm that returns an array that contains the labels of every datapoint. What I want to do is to find the indices of every data point that belongs to the largest cluster. To do this naively involves quite a few steps. First I have to find all the unique labels in the array. Then I have to get the most frequent label. Then finally I can create a mask that mask only the elements which have the label that is equal to the most frequent label and then find the indices based on that. See the example code below:

import numpy as np

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
labels = np.array([1, 2, -1, 4, -1, -1, 4, 3, -1, -1])
uq_labels, uq_counts = np.unique(labels, return_counts=True)
most_frequent_label = uq_labels[uq_counts.argmax()]
largest_cluster_indices = np.asarray(labels == most_frequent_label).nonzero()[0]

This is quite inefficient. It involves creating quite a few intermediate arrays and traversing the data. Is there any other more efficient way to do the same thing?

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Does this answer your question? [Find the most frequent number in a NumPy array](https://stackoverflow.com/questions/6252280/find-the-most-frequent-number-in-a-numpy-array) – swag2198 Jun 28 '21 at 15:42
  • @swag2198 No that is a completely different question. –  Jun 28 '21 at 15:57
  • "This is quite inefficient"-> Based on what metric? Have you timed/profiled it and found a problem? – Mad Physicist Jun 28 '21 at 17:02
  • @MadPhysicist Based on the logical metric that you could do it in a single for loop with intermediates. –  Jun 28 '21 at 17:27

2 Answers2

0

you could try the mode function from scipy.stats to get the most frequent value and then np.where to locate them. Something like this:

from scipy.stats import mode
# get the most frequent value and locate the index using np.where
np.where(labels == mode(labels)[0][0])[0]

Or if you don't want to use a new library, you could use Counter from collections instead.

from collections import Counter

counter =  Counter(labels)
np.where(labels == counter.most_common(1)[0][0])[0]
heretolearn
  • 6,387
  • 4
  • 30
  • 53
0

You can simplify some of the calculation by using return_inverse=True, which computes :

uq_labels, uq_inv, uq_counts = np.unique(labels, return_counts=True)
largest_cluster_indices = np.flatnonzero(uq_inv == uq_counts.argmax())

If you don't actually need the indices because a mask will do, don't bother calling np.flatnonzero.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264