How to store data elements in groups/clusters by union-find

Question

I have implemented union-find algorithm inspired by this code with this API:

void unify(int p, int q)
int find(int p)
boolean connected(int p, int q)
int clusterSize(int p)

Then I detect clusters with a nested loop like this:

for i = (0) to (data size)
    for j = (i+1) to (data size)
        if ( "i" and "j" meet this condition )
            unify(i,j)
    end for
end for

To actually store each cluster/group in a separate array, currently I loop over all my data elements and call find like this:

for i = 0 to data-size
    k = find(i);
    // ... Store data element "i" in cluster "k" (array "k")
end-for

I have a feeling like, the last for loop to store data elements in different groups/clusters might not be necessary. I wonder if I'm on right track?

Are you running the for loop to initialize? If so, then calling `find(i)` is not necessary. The idea is - initially all the elements are separate cluster of their own. Then upon some condition you will call `unify(p,q)` which will unify those two elements and make one cluster. — Arnab Roy, Feb 13 '19 at 08:49

score 1 · Answer 1 · answered Feb 13 '19 at 10:21

First you need to initialize by making each element a separate cluster. While doing so there's no need to call find method as initially each of them itself is a cluster.

int clusters[data_size];
for(int i = 0; i < data_size; i++)
{
  clusters[i] = i; // each element refers to itself initially, indicating that it's a separate cluster
}

Now upon the nested loop check which elements are needed to be clustered, and then unify those -

for i = (0) to (data size)
    for j = (i+1) to (data size)
        if ( "i" and "j" meet this condition )
            unify(i,j)
    end for
end for

In the unify(p, q) method, what you can do is -

clusters[find(p)] = find(q); // indicates p belongs to cluster number q

Or the vice versa is also true -

clusters[find(q)] = find(p); // indicates q belongs to cluster number p

There's a ranking technique in the Union Find algorithm, to choose which of the above to use. That ensures - we will always merge the smaller cluster into the bigger cluster. But I assume you are not worried about that right now.

Now, the next question would be - why are we calling find in the above expression. It's because, we are actually finding out the root element of a certain cluster(Or simply find the element which actually represent the cluster).

Thanks! So far so good. Now I wonder what is the best practice to extract my clusters in separate arrays like this: `cluster_1 = [element_1, element_5, element_7]`, `cluster_2 = [element_2, element_3, element_6]`, `cluster_3 = [element_4, element_8]` ? Do I have to use loops? — Megidd, Feb 13 '19 at 10:28
There are a number of ways to do that. Looping over would be one way. Though I don't think it will be very efficient. [this answer](https://stackoverflow.com/a/23055353/9041122) has a very useful insight. Also [this](https://www.hackerearth.com/practice/notes/disjoint-set-union-union-find/) has a nice illustration. Refer to the alternate implementation of the second link. — Arnab Roy, Feb 13 '19 at 10:44

How to store data elements in groups/clusters by union-find

1 Answers1