2

I have a list of sets called groups. groups[i] is a set of labels (integers) that are part of the same group as i. For example,

# idx:     0    1    2        3    4      5        6        7
groups = [{0}, {1}, {2,5,6}, {3}, {4,7}, {2,5,6}, {2,5,6}, {4,7}]

indicates that

  • 0 belongs to the group containing just 0
  • 1 belongs to the group containing just 1
  • 2, 5, and 6 each belong to the same group, namely the group containing 2, 5 and 6
  • 3 belongs to the group containing just 1
  • 4 and 7 each belong to the same group, namely the group containing 4 and 7

Sometimes it turns out that two groups are the same. For example, say we found out that 4 and 5 belong to the same group. Then we have to merge group {2,5,6} with group {4,7}. After this merge, groups should look like:

# idx:     0    1    2            3    4            5            6            7
groups = [{0}, {1}, {2,4,5,6,7}, {3}, {2,4,5,6,7}, {2,4,5,6,7}, {2,4,5,6,7}, {2,4,5,6,7}]

Naively, this would require updating groups[2], groups[4], groups[5], groups[6] and groups[7] or, in general, when merging groupA with groupB, it would require len(groupA) + len(groupB) updates.


A more efficient approach, would be if instead of having multiple copies of the same set in groups, we had a reference to the same set

# idx:     0    1   2        3   4       5       6       7
groups = [{0}, {1}, groupA, {3}, groupB, groupA, groupA, groupB]

where groupA == {2,5,6} and groupB == {4,7}

Then, to merge groupA and groupB would require at most min(len(groupA), len(groupB)) updates:

if len(groupA) < len(groupB):
  groupA, groupB = groupB, groupA
# len(groupA) >= len(groupB)
groupA |= groupB
for label in groupB:
  groups[label] = groupA

which would result in

# idx:     0    1   2        3   4       5       6       7
groups = [{0}, {1}, groupA, {3}, groupA, groupA, groupA, groupA]

where groupA == {2,4,5,6,7}


Is there a more efficient way to merge groups? Is Union-Find (Disjoin Set Union) data structure the way?

joseville
  • 685
  • 3
  • 16
  • 2
    Wait, you *know* about union-find data structures already? Just use those! They're designed for the job, and *way* more efficient than anything you can do with standard `set`s. – user2357112 Nov 16 '22 at 01:25

1 Answers1

0

I think it's about the best you can do.

Sets are, to my knowledge, implemented via a hash table in the Python standard library.

Depending on the implementation, yes, I think the set unions should be performed in O(min(a,b)) time, where a and b are the length of two sets. You just write each member of the smaller set into the larger set. If the member already exists, that's fine -- if both sets share a member, you'll just be overwriting an existing member in the set.

You can make your code perform a bit better by wrapping it up as follows:

def smash(groups, *indices):
    idx0 = indices[0]
    for idx in indices[1:]:
        groups[idx0] |= groups[idx]
        groups[idx] = groups[idx0]

Python >=3.9 helps you out here: to my knowledge, if the left and right side of |= are identical (same place in memory), it just skips the operation.

(A note for others: the |= operator works in Python >=3.9. Before that, you'll need to use the instance .update method.)

So:

  • each index (element in indices) maps to a group in groups
  • for each index you provide, go and try and merge everything into the first group
  • after merging in, update the reference

Because this does the identity check at the same time, this is performed on average in O(N + M*N) time, where:

  • N is the number of unique groups you want to merge (N <= len(indices))

  • M is the expectation for the average time taken to do a union operation, assuming that the order of the indices you provide is random, and not sorted to optimize the union operations:

    average union time

    with a_i as the size of set i.

soyapencil
  • 509
  • 4
  • 9
  • This `smash` function doesn't work. [It doesn't update all the indices that need updating.](https://ideone.com/KiHR0N) (Also, the set `|=` operator has existed since the `set` type was added to Python.) – user2357112 Nov 16 '22 at 01:32