Python clustering list of sets

Question

For a collection of sets, I'll need to group them into multiple "cluster" that all the node in the cluster is a subset of the largest node in that cluster. E.g.

input = [{1, 2}, {1, 2, 3}, {1, 2, 4}, {1, 4}, {1}]

will get:

[{1, 2, 3}, {1, 2}, {1}], 
[{1, 2, 4}, {1, 2}, {1, 4}, {1}]

I've tried building a subset tree with reference of this, but it soon become very slow when input is large since it's iterating all children for every insertion.

I'm not familiar with k-mean clustering, but does it apply to the problem here?

What is the most efficient way of doing it?

what is the output for `[{1, 2}, {1, 2, 3}, {1, 2, 4}, {1}]` ? — Epsi95, Feb 21 '21 at 16:19
What is called "[clustering](https://en.wikipedia.org/wiki/Cluster_analysis)" in the literature is something different. K-Means and other clustering algorithms won't help you here. — Thomas, Feb 21 '21 at 16:42

Tomerikoo · Answer 1 · 2021-02-21T16:38:29.637

1

First sort the list in descending order of lengths. This way you start with the longest sets which surely are not subsets of any other.

Then, save each representative set as the key (after converted to a tuple) of a dict with list values.

For each set, check if it's a subset of any key and add it to the respective list.

Only if it wasn't added to any key, it means it's a new representative.

In the end, take the values() of the result dict:

l = [{1, 2}, {1, 2, 3}, {1, 2, 4}, {1, 4}, {1}]

grouped_sets = {}
for cur_set in sorted(l, key=len, reverse=True):
    is_subset = False
    for represent, sets in grouped_sets.items():
        if cur_set.issubset(represent):
            sets.append(cur_set)
            is_subset = True

    if not is_subset:
        grouped_sets[tuple(cur_set)] = [cur_set]

print(list(grouped_sets.values()))

Which gives:

[[{1, 2, 3}, {1, 2}, {1}], 
 [{1, 2, 4}, {1, 2}, {1, 4}, {1}]]

edited Feb 21 '21 at 16:38

answered Feb 21 '21 at 16:33

Tomerikoo

18,379
16
47
61

This looks OK with small enough inputs, but with large inputs and when length of 'res' is large, insertion is going to be really slow right? Since we are doing issubset checking for the entire list for each insertion. I think it might improve if we use some sort of tree to do subset check to avoid redundant check? – Guo Xian HO Feb 21 '21 at 16:45
It only checks `issubset` for each representative. Not for the whole list. For the above example there are only two representatives – Tomerikoo Feb 21 '21 at 16:47
But representative can also grow into very huge given large enough inputs, and we might do redundant subset check right? E.g. Inserting {2,3} into [{9,4,8}, {9, 8}], [{10,4,5}], [{2,3,4}], and first 2 issubset check is kinda redundant since there is no common element. – Guo Xian HO Feb 21 '21 at 16:58
1

I'm pretty sure this solution is optimal, it iterates each set once and then places it in the relevant clusters, there isn't any redundant checking here (as far as I can tell) And with reference to your latest comment @GuoXianHO, the isset check is only for the _reference_ or _header_ set for each cluster, which is essentially the seed for that group. And that is a required check to see if the set in question belongs in that cluster. I was thinking a similar tree sol'n before Tom posted his, but reading it I think it's the best you can do – Joseph Holland Feb 21 '21 at 16:58
@GuoXianHO Of course some checks will be *"redundant"* but how can you avoid that? You need to check to what "cluster" the current set fits. Note that by sorting the list, as the sets get smaller there are more chances that most checks are ***not*** redundant. And while the sets are still big, there are not yet many representatives to check – Tomerikoo Feb 21 '21 at 17:11

score 1 · Answer 2 · answered Feb 21 '21 at 17:36

Perhaps sorting the sets by decreasing order of length will reduce the number of intersections to be made down to one per cluster per set. This will be data dependent and won't improve if there are no subsets but it should improve as the clusters are bigger:

setList = [{1, 2}, {1, 2, 3}, {1, 2, 4}, {1, 4}, {1}]

groups = []
for aSet in sorted(setList,key=len,reverse=True):
    clusters = [g for g in groups if g[0].issuperset(aSet)]
    if not clusters:
        groups.append([])
        clusters = groups[-1:]
    for g in clusters:
        g.append(aSet)

print(groups)

[[{1, 2, 3}, {1, 2}, {1}], [{1, 2, 4}, {1, 2}, {1, 4}, {1}]]

Python clustering list of sets

2 Answers2