1

I have this simple kmeans algorithm that I apply on a list of float lists :

def clustering(k,lists_to_cluster):
    max_vals = [max(sublist) for sublist in lists_to_cluster]
    kmeans_ampl = KMeans(k, random_state=123).fit(np.array(max_vals).reshape(-1,1))
    centroids_ampl = kmeans_ampl.labels_ 
    return centroids_ampl

centroids_labels = clustering(3,lists_to_cluster)

centroids_labels returns [0,0,1,2,2,0]but the lists with highest mex_vals are labeled 0. I'd like to cluster labels to be sorted in a max_vals ascending order (label 0 is assigned to the lists with lowest max_vals, etc until label k-1 with highest max_vals). Is there a way to do it before/during applying kmeans or should I just sort and map after applying it ? Thanks !

Movilla
  • 166
  • 8
  • Hi! Can you please [edit] to include the relevant `import` at the top of your code snippet? I also suggest adding a line `lists_to_cluster = [........]`. This way your code snippet will be a [mre]. – Stef May 04 '23 at 15:01
  • Note that your code only uses `lists_to_cluster` to compute the list `max_vals`. Then you cluster these values. The sublists are mostly ignored, except for extracting their maximum values. – Stef May 04 '23 at 15:05
  • Also note that `max_vals` is a list of simple numeric values, so what you're doing is clustering in 1d. KMeans is great starting at 2d, but for 1d it's really inappropriate, both overkill and inefficient. See for instance https://stackoverflow.com/a/20241986/3080723 – Stef May 04 '23 at 15:22
  • The awkward expression `np.array(max_vals).reshape(-1,1)` is a big hint that kmeans doesn't expect 1d data. – Stef May 04 '23 at 15:23

1 Answers1

1

You can group the maxvals by cluster into a dictionary that maps cluster label to list of maxvals.

Then sort the dictionary values (the lists) by min maxval, or max maxval, or whatever.

def relabel(labels, vals):
    d = {}
    for k, v in zip(labels, vals):
        d.setdefault(k, []).append(v)
    return list(enumerate(sorted(d.values(), key=min))) # or key=max, or key=statistics.mean

lists_to_cluster = [[1], [2], [3], [6], [7], [8], [101], [102], [103]]
max_vals = [max(sublist) for sublist in lists_to_cluster]
centroids_labels = clustering(3,lists_to_cluster)
print( relabel(centroids_labels, max_vals) )
# [(0, [1, 2, 3]), (1, [6, 7, 8]), (2, [101, 102, 103])]
Stef
  • 13,242
  • 2
  • 17
  • 28