How to merge sets in better then O(len(set))?

Question

I have a list of sets called groups. groups[i] is a set of labels (integers) that are part of the same group as i. For example,

# idx:     0    1    2        3    4      5        6        7
groups = [{0}, {1}, {2,5,6}, {3}, {4,7}, {2,5,6}, {2,5,6}, {4,7}]

indicates that

0 belongs to the group containing just 0
1 belongs to the group containing just 1
2, 5, and 6 each belong to the same group, namely the group containing 2, 5 and 6
3 belongs to the group containing just 1
4 and 7 each belong to the same group, namely the group containing 4 and 7

Sometimes it turns out that two groups are the same. For example, say we found out that 4 and 5 belong to the same group. Then we have to merge group {2,5,6} with group {4,7}. After this merge, groups should look like:

# idx:     0    1    2            3    4            5            6            7
groups = [{0}, {1}, {2,4,5,6,7}, {3}, {2,4,5,6,7}, {2,4,5,6,7}, {2,4,5,6,7}, {2,4,5,6,7}]

Naively, this would require updating groups[2], groups[4], groups[5], groups[6] and groups[7] or, in general, when merging groupA with groupB, it would require len(groupA) + len(groupB) updates.

A more efficient approach, would be if instead of having multiple copies of the same set in groups, we had a reference to the same set

# idx:     0    1   2        3   4       5       6       7
groups = [{0}, {1}, groupA, {3}, groupB, groupA, groupA, groupB]

where groupA == {2,5,6} and groupB == {4,7}

Then, to merge groupA and groupB would require at most min(len(groupA), len(groupB)) updates:

if len(groupA) < len(groupB):
  groupA, groupB = groupB, groupA
# len(groupA) >= len(groupB)
groupA |= groupB
for label in groupB:
  groups[label] = groupA

which would result in

# idx:     0    1   2        3   4       5       6       7
groups = [{0}, {1}, groupA, {3}, groupA, groupA, groupA, groupA]

where groupA == {2,4,5,6,7}

Is there a more efficient way to merge groups? Is Union-Find (Disjoin Set Union) data structure the way?

Wait, you *know* about union-find data structures already? Just use those! They're designed for the job, and *way* more efficient than anything you can do with standard `set`s. — user2357112, Nov 16 '22 at 01:25

score 0 · Answer 1 · answered Nov 06 '22 at 18:16

I think it's about the best you can do.

Sets are, to my knowledge, implemented via a hash table in the Python standard library.

Depending on the implementation, yes, I think the set unions should be performed in O(min(a,b)) time, where a and b are the length of two sets. You just write each member of the smaller set into the larger set. If the member already exists, that's fine -- if both sets share a member, you'll just be overwriting an existing member in the set.

You can make your code perform a bit better by wrapping it up as follows:

def smash(groups, *indices):
    idx0 = indices[0]
    for idx in indices[1:]:
        groups[idx0] |= groups[idx]
        groups[idx] = groups[idx0]

Python >=3.9 helps you out here: to my knowledge, if the left and right side of |= are identical (same place in memory), it just skips the operation.

(A note for others: the |= operator works in Python >=3.9. Before that, you'll need to use the instance .update method.)

So:

each index (element in indices) maps to a group in groups
for each index you provide, go and try and merge everything into the first group
after merging in, update the reference

Because this does the identity check at the same time, this is performed on average in O(N + M*N) time, where:

N is the number of unique groups you want to merge (N <= len(indices))
M is the expectation for the average time taken to do a union operation, assuming that the order of the indices you provide is random, and not sorted to optimize the union operations:

with a_i as the size of set i.

This `smash` function doesn't work. [It doesn't update all the indices that need updating.](https://ideone.com/KiHR0N) (Also, the set `|=` operator has existed since the `set` type was added to Python.) — user2357112, Nov 16 '22 at 01:32

How to merge sets in better then O(len(set))?

1 Answers1