I have a clustering algorithm that returns an array that contains the labels of every datapoint. What I want to do is to find the indices of every data point that belongs to the largest cluster. To do this naively involves quite a few steps. First I have to find all the unique labels in the array. Then I have to get the most frequent label. Then finally I can create a mask that mask only the elements which have the label that is equal to the most frequent label and then find the indices based on that. See the example code below:
import numpy as np
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
labels = np.array([1, 2, -1, 4, -1, -1, 4, 3, -1, -1])
uq_labels, uq_counts = np.unique(labels, return_counts=True)
most_frequent_label = uq_labels[uq_counts.argmax()]
largest_cluster_indices = np.asarray(labels == most_frequent_label).nonzero()[0]
This is quite inefficient. It involves creating quite a few intermediate arrays and traversing the data. Is there any other more efficient way to do the same thing?