1

I'm trying to optimize a union-find algorithm for finding connected components in images. My image can be a 2d or 3d file, consisting of 0s and 1s. I found an implementation in this thread: Connected Component Labelling, with the answer by user Dukering.

I adapted that code to my purpose. The code works, but the execution time rapidly becomes too big. I don't understand the problem.

My code is shown below. The file I was testing it with is linked here: https://utexas.box.com/s/k12m17rg24fw1yh1p21hytxwq5q8959u That is a 2223x2223 size file (defined in the program below).

As the original user mentioned, this is a basic implementation of union-find, and one can make it more efficient. I don't understand how. In addition, I tested this image in Matlab, and Matlab is much faster. For example, the image linked above takes ~1.5 minutes on my computer, but Matlab does it in like a second using bwlabel. I checked the algorithms bwlabel uses, and it seemed to be some variation of union-find, which is why I started this effort in the first place. How do I make my code work as fast as that? I should also mention that I hope to run my code on much larger images (as large as 1000^3). There is no way my current version can do that.

    #include <time.h>
    #include <stdlib.h>
    #include <stdio.h>

    #define w 2223
    #define h 2223

    void writeArrayInt(int *data, int dims[], char *filename)
    {
     FILE *fp;

     fp = fopen(filename,"w"); 

     /* write grid dimensions */
     fwrite(dims, sizeof(int), 3, fp); 

      /* write data array */
      fwrite(data, sizeof(int), w*h, fp);

      fclose(fp);
      }

      void readArrayInt(int *data, int dims[], char *filename)
      {
       FILE *fp;

       fp = fopen(filename,"r"); 

       /* read grid dimensions */
       fread(dims, sizeof(int), 3, fp); 

       /* read data array */
       fread(data, sizeof(int), w*h, fp);

       fclose(fp);
       }

       void doUnion(int a, int b, int *component)
       {
        // get the root component of a and b, and set the one's parent to the other
       while (component[a] != a)
         a = component[a];
       while (component[b] != b)
         b = component[b];
       component[b] = a;
       }

       void unionCoords(int x, int y, int x2, int y2, int *component, int *input)
       {
        int ind1 = x*h + y;
        int ind2 = x2*h + y2;
        if (y2 < h && x2 < w && input[ind1] && input[ind2] && y2 >= 0 && x2 >= 0)
    doUnion(ind1, ind2, component);
        }

       int main()
       {
       int i, j;
       int *input = (int *)malloc((w*h)*sizeof(int));
       int *output = (int *)malloc((w*h)*sizeof(int));
       int dims[3];

       char fname[256];
       sprintf(fname, "phi_w_bin");
       readArrayInt(input, dims, fname); 

       int *component = (int *)malloc((w*h)*sizeof(int));

       for (i = 0; i < w*h; i++)
         component[i] = i;

 for (int x = 0; x < w; x++)
    for (int y = 0; y < h; y++)
    {
        unionCoords(x, y, x+1, y, component, input);
        unionCoords(x, y, x, y+1, component, input);
        unionCoords(x, y, x-1, y, component, input);
        unionCoords(x, y, x, y-1, component, input);
        unionCoords(x, y, x+1, y+1, component, input);
        unionCoords(x, y, x-1, y+1, component, input);
        unionCoords(x, y, x+1, y-1, component, input);
        unionCoords(x, y, x-1, y-1, component, input);
    }

for (int x = 0; x < w; x++)
{
    for (int y = 0; y < h; y++)
    {
        int c = x*h + y;
        if (input[c] == 0)
        {
            output[c] = input[c];
            continue;
        }
        while (component[c] != c) c = component[c];

        int c1 = x*h + y;
        output[c1] = component[c];
    }
}

sprintf(fname, "outputImage2d");
writeArrayInt(output, dims, fname);  

free(input);
free(output);
free(component);  
}
rahulv88
  • 101
  • 9
  • If your code is working and you want suggestion on how to improve its performance then [codereview.se] is a more appropriate place for this question. – kaylum Jul 03 '17 at 05:03
  • Please indent your code. If you indent with tab key, now is the time to go find how to tell your IDE to insert spaces... – Lundin Jul 03 '17 at 06:13
  • Your indentation is really needs help: and fixing it would help you get help. – Richard Jul 03 '17 at 07:08

3 Answers3

2

I would recomment two improvements to your union-find structure:

  • Actually implement both union and find! If you have a working find method, implementing union becomes much simpler because you don't need the while (component[c] != c) kind of lines. For reference, check out the informative Wikipedia entry on union-find data structures
  • Implement some of the common speedup heuristics like path-compression (store the value that find(x) returns in component[x], thus reducing the time needed in a second call of find(x)) and union-by-rank or union-by-size (make the larger set the parent of the smaller set)

EDIT: Since there seemed to be some clarification needed concerning another answer, I'll add a minimal implementation myself:

typedef struct {
    int* parent;
    int size;
} union_find;

union_find make_sets(int size) {
    union_find result;
    result.parent = malloc(sizeof(int) * size);
    result.size = size;
    for (int i = 0; i < size; ++i) {
        result.parent[i] = size;
    }

    return result;
}

int find(union_find uf, int i) {
    if (uf.parent[i] < uf.size)
        return uf.parent[i] = find(uf, uf.parent[i]);
    return i;
}

void do_union(union_find uf, int i, int j) {
    int pi = find(uf, i);
    int pj = find(uf, j);
    if (pi == pj) {
        return;
    }
    if (pi < pj) {
        // link the smaller group to the larger one
        uf.parent[pi] = pj;
    } else if (pi > pj) {
        // link the smaller group to the larger one
        uf.parent[pj] = pi;
    } else {
        // equal rank: link arbitrarily and increase rank
        uf.parent[pj] = pi;
        ++uf.parent[pi];  
    }
}
Tobias Ribizel
  • 5,331
  • 1
  • 18
  • 33
  • I'm a bit confused. Is there just one parent array which stores parents for everyone? So when I initialize, I just have one union_find structure variable, uf? and then all my operations are performed on uf? If that is the case, then do I initially just do `make_sets(N)` where N is the total size of the array? – rahulv88 Jul 03 '17 at 18:28
  • Yes, the parent array has nearly the same functionality as the components array in your case. The only difference is the way the array is initialized (parent[i] = size instead of parent[i] = i) and what these entries mean if parent[i] >= size. `make_sets(w*h)` semantically does the same thing as your initialization (`for (i = 0; i < w*h; i++) component[i] = i;`) – Tobias Ribizel Jul 03 '17 at 19:55
  • It looks like this works very well, thanks a lot Tobias! I have one additional question: if I do the same thing for 3d, where I would have 26-connectivity (instead of 8-connectivity in my example code), would things slow down any? From my understanding of the time complexity, it should be more or less the same as the 2d case right? – rahulv88 Jul 03 '17 at 22:16
  • The time complexity of union and find should be nearly independent of the number of elements, yes. But since you will probably have lots more pixels to iterate over *and* more neighbors for every pixel, the overall runtime will probably be a bit larger – Tobias Ribizel Jul 03 '17 at 23:19
1

Union-find should work in constant time if correctly implemented.

Here are a few ideas:

-- modify find such that each time when you go up the tree until you reach the root (the root being the node with the property NODE.up = NODE ), you update all the UP nodes for all the nodes that you followed going up. In other words, when you look for the connected component of 2 nodes, you update that component (represented as the index of its root node) for all the nodes that were followed on that path.

-- the second time when you find the component of a node it will be constant time not only for itself but also for the intermediate nodes.

-- union should all the time take likear time array[node] = parent_node.

alinsoar
  • 15,386
  • 4
  • 57
  • 74
-1

One of the good working algorithm for disjoint-sets using union by rank and path compression is follows:

Implementation, using struct Node component[]. Which contains, the array of all the elements.

#include <stdio.h>
#include <stdlib.h>

struct Node
{
    // Needed for union and find.
    int parent;
    int rank;
};

// Find implementation using path compression, NOTE: a is index of the element to be found.
int find (struct Node *component, int a)
{
    if (component[a].parent != a)
        return component[a].parent = find(component[a], component[a].parent)
    return a;
}

// Union implementation using rank. NOTE: a and b are index of the element
void union(struct Node *component, int a, int b)
{
    if (find(component, a) != find(component, b))
    {

        if (component[a].rank == component[b].rank)
            component[a].rank += 1;

        if (component[a].rank >= component[b].rank)
            component[b].parent = a;
        else
            component[a].parent = b;
    }    
}

You can use the above functions, to do Union-Find in constant time (amortised). And it should be clear that you might have to modify the structure, as it fits your data.

You can also implement it in C++ by using templates. But as the question is tagged with C, hence I have provided this solution.

If you wanna read about the above algorithm, this link might help.Union-Find Algorithm

Please comment for any further clarification.

NVS Abhilash
  • 574
  • 7
  • 24
  • 1
    Union-find can be quite efficiently implemented without using pointer indirection, using only one large malloc instead of one for every node. Using pointers IMHO makes it much more confusing than just using a flat array storing the parent indices for every element. – Tobias Ribizel Jul 03 '17 at 06:48
  • @Tobias I have added both implementation, for the sake of completeness. Please review the recent changes. – NVS Abhilash Jul 03 '17 at 07:09
  • I think I need to clarify what I meant: The implementation itself is more readable this way, but still jumping between pointers and indices is IMO not that helpful if you mix them up. The return statement of find(...) seems to return an incorrect value (should be a parent pointer or something similar, but is simply the input argument a, which doesn't get changed in the method at all) The implementation of OP already seems to be based around indices, so why jump to pointers which need to store the index (val) explicitly? – Tobias Ribizel Jul 03 '17 at 07:19
  • @Tobias Thanks! Now I get it. I think your second opinion is right. I don't think your views on find are correct. Please look carefully, it returns in the `if` condition as well. I request you to please review my recent edits and comment. Thanks! – NVS Abhilash Jul 03 '17 at 07:25
  • @Tobias The component array, must have both the parent index as well as the rank, so that it can use the benifits of both the heuristics. – NVS Abhilash Jul 03 '17 at 07:26
  • Did you check if the code compiles? It seems to me like the recursive call to find is missing an argument. Concerning your second question, storing the value would be unnecessary if we just used a flat array. Also, there is a neat trick to avoid storing the ranks by using out-of-bounds indices as rank indicators, but this is a bit too advanced for this simple question and also provides not that large a speedup. – Tobias Ribizel Jul 03 '17 at 07:31
  • It seems we're talking past each other a bit, I've added a sample implementation (copied from my code, so it's using the out-of-bounds trick for rank storage) to clarify what I meant. – Tobias Ribizel Jul 03 '17 at 08:03
  • Hi, thanks a lot for both your comments. So @NVS Abhilash, how do I initialize my data? I mean, like I say in my question, my file consists of a large array of 0s and 1s. So, in your implementation, initially, would I just have n nodes, where n is the number of 1s? and then I go about doing union on each vertex neighbor? Also, once I have the union done, I don't need to do find again, right? – rahulv88 Jul 03 '17 at 15:03
  • @rahulv88 Yes that's exactly what you should do IMO. If you do find, then it will return the representative of that set, hence you can differentiate using the find method all the different components. Different representative implies, different components. – NVS Abhilash Jul 03 '17 at 17:01
  • Um ok...so parent is a pointer right? Can I do this: `component[a].parent != a`? I mean one is a pointer, and the other is an int? Sorry I'm not a very advanced coder! – rahulv88 Jul 03 '17 at 17:26
  • @rahulv88 sorry, I have changed the code. The parent is it, the index. – NVS Abhilash Jul 03 '17 at 19:51
  • @NVSAbhilash thanks a lot for your efforts man....I'll accept Tobias' code for now as it's working already. – rahulv88 Jul 03 '17 at 22:17