2

Problem

I am trying to do connected component labling for arrays of more than 3 dimensions. What I mean by that is that my boolean array has a .shape e.g. like (5,2,3,6,10) which would be 5 dimensions.

For 2D images (instead of my >3D problem), connected component labling would is putting labels to connected areas (hyper-volumes in my case). Two (hpyer-)pixels are connected if the are next to each other and both are True in the boolean array.

enter image description here enter image description here

What I already tried

For 2 dimensions this can be done with OpenCV and with up to 3 dimensions this can be done with scikit-image's skimage.measure.label. However, I am not sure how to it for my case.


Further material for the interested reader (but it does not help my question):

Make42
  • 12,236
  • 24
  • 79
  • 155
  • Did you try an naive iterative approach? Iterating over every data point, set an unused label to this datapoint, check every connected neighbour to this data point, if there are already labeled neighbours, set the label to the current datapoint, otherwise set the label of the current data point to the connected neighbour. Would this work? – Yannick Funk Mar 20 '21 at 17:08
  • @YannickFunk: That was my first implementation. I did it using recursion and without a union-find data structure as it is done in the so called [Classical Algorithm](https://iq.opengenus.org/connected-component-labeling/). This first implementation of mine is insanely slow. It did not help that I implemented this using Python directly (instead of pushing the implementation down to C level). – Make42 Mar 20 '21 at 17:27
  • Why is it so slow? How big are the arrays? – Paul Brodersen Mar 22 '21 at 10:25
  • @PaulBrodersen: Each dimension y is between 15 and 50 and we have between 2 and 64 dimensions, so are y^x elements in an array. As you can easily see, x is what is hurtful. (I know: when it goes to number like 64, the memory is not even enough to house my arrays, so I am currently rethinking the entire approach. But for now let's say that x is small, like 4.) I think it is slow, because I programmed it in Python with loops and so on - but I am not sure. – Make42 Mar 22 '21 at 10:45
  • I am not using the classical algorithm, as I noticed after further inspection. I am starting at each cell that is True ("foreground") and see if it is given a label. For tracking labels I use a separate label-array which is initialized with 0 (= "background" / no label). If it has no label yet I label it and recursively "go" into each neighboring cell that has no label yet and label that with the same label. Such labels are integers >0. Cells that are already labeled or that are False ("background") are not further inspected. – Make42 Mar 22 '21 at 10:51
  • The algorithm you describe **is** one of the classical algorithms, but it's not the most efficient one. I implemented a very efficient labeling algorithm for arbitrary number of dimensions in [DIPlib](https://diplib.org): https://diplib.org/diplib-docs/regions.html#dip-Label-dip-Image-CL-dip-Image-L-dip-uint--dip-uint--dip-uint--dip-StringArray- – Cris Luengo Mar 22 '21 at 15:49
  • @CrisLuengo: Is it exposed to the connector for Python? https://diplib.org/PyDIP.html I did not see a full-fledged documentation... :-/ – Make42 Mar 22 '21 at 16:08
  • Yes, it is. In Python you’d call `dip.Label()` (after `import diplib as dip`). You can use a NumPy array as input directly, the `dip.Image` objects are compatible with them. – Cris Luengo Mar 22 '21 at 16:40
  • @CrisLuengo: Just to clarify: 1) It is faster than the scikit-image implementation, right? 2) What is the runtime complexity regarding number of dimensions and dimensionality per dimension? – Make42 Mar 22 '21 at 17:04
  • I haven’t compared to the skimage implementation, but I know it is very efficient. It is a union-find algorithm, which means it is very close to O(n), for n pixels, independent of image dimensionality ([the actual complexity](https://en.wikipedia.org/wiki/Disjoint-set_data_structure#Time_complexity) involves the inverse Ackermann function, which grows extremely slowly with n). – Cris Luengo Mar 23 '21 at 02:22

2 Answers2

3

scipy.ndimage.label does what you want directly:

In [1]: import numpy as np
In [2]: arr = np.random.random((5,2,3,6,10)) > 0.5
In [3]: from scipy import ndimage as ndi
In [4]: labeled, n = ndi.label(arr)
In [5]: n
Out[5]: 11
Juan
  • 5,433
  • 21
  • 23
2

If what would be a 4-connectivity in 2D is enough, you can get the neighbouring pixels that are also foreground in n log n time using a nearest neighbour tree. Then it is matter of constructing the graph and finding the connected components (also n log n, IIRC).

#!/usr/bin/env python
"""
https://stackoverflow.com/questions/66724201/connected-component-labling-for-arrays-quasi-images-with-many-dimension
"""
import numpy as np
import networkx as nx

from scipy.spatial import cKDTree


def get_components(boolean_array):
    # find neighbours
    coordinates = list(zip(*np.where(boolean_array)))
    tree = cKDTree(coordinates)
    neighbours_by_pixel = tree.query_ball_tree(tree, r=1, p=1) # p=1 -> Manhatten distance; r=1 -> what would be 4-connectivity in 2D

    # create graph and find components
    G = nx.Graph()
    for ii, neighbours in enumerate(neighbours_by_pixel):
        if len(neighbours) > 1:
            G.add_edges_from([(ii, jj) for jj in neighbours[1:]]) # skip first neighbour as that is a self-loop
    components = nx.connected_components(G)

    # create output image
    output = np.zeros_like(data, dtype=np.int)
    for ii, component in enumerate(components):
        for idx in component:
            output[coordinates[idx]] = ii+1

    return output


if __name__ == '__main__':

    shape = (5, 2, 3, 6, 10)
    D = len(shape)
    data = np.random.rand(*shape) < 0.1
    output = get_components(data)

For an array with shape (50, 50, 50, 50), I get the following timings on my laptop:

In [48]: %timeit output = get_components(data)
5.85 s ± 279 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Paul Brodersen
  • 11,221
  • 21
  • 38
  • What do you mean by "If what would be a 4-connectivity in 2D is enough"? Did you mean to say that you use "4-connectivity in 4D, which would be 2-connectivity in 2D"? I follow the terminology of https://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.label. (I actually usually use 1-connectivity though, but appreciate both :-).) – Make42 Mar 22 '21 at 11:24
  • I know what they call 1-connectivity as 4-connectivity, and what they call 2-connectivity as 8-connectivity. But then my image processing lectures have been a while ago so I might misremember. So yeah, this implements 1-connectivity (according to scikit-image terminology). – Paul Brodersen Mar 22 '21 at 11:39
  • It just occurred to me that you can get 8-connectivity (2-connectivity) by looking for points with a Euclidean distance strictly less than, i.e. `tree.query_ball_tree(tree, r=1.5, p=2)`. – Paul Brodersen Mar 22 '21 at 11:50
  • Also, I updated the answer. There was bug such that the first component was not labeled. – Paul Brodersen Mar 22 '21 at 11:52
  • I read through your code and think I understand it ;-). A little review (tell me if you do not agree): 1) It probably works well for sparse arrays (= lots of background) but badly for dense arrays. For sparse arrays, it probably scales even nicely (regarding constants) for many dimensions. Unfortunately, as I noticed yesterday very late at night, my arrays are very dense (which I did not expect at first). However, I have an idea how I could make them sparse, but I still have to think about that idea. – Make42 Mar 22 '21 at 11:53
  • 2) Your approach becomes particularly useful if one does not really need the array `output`, but only the components - e.g. the size of the components or the coordinates of the components' elements. That would be in favor in my case. 3) Your speed is not horrible with 5secs, but not super either. I need to repeat the algorithm many times (100x in experimental settings, 10^(3+) for real-life data.) I think the KDTree of scikit-learn might actually be faster though: https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-searches-in-python/ Maybe there are more ways for speed-up. – Make42 Mar 22 '21 at 11:58
  • I don't think sparsity matters much. The code is n log n where n is the number of foreground pixels. If the sparsity changes from 0.1 or 0.5, then that is just a factor ~6 slow down. – Paul Brodersen Mar 22 '21 at 11:59
  • After you edit: Your "n log n" makes sense now. That is also why it becomes slow in a dense setting: The number of foreground elements "n" grows exponentially. When I mean "dense", I mean that 95% of pixels are foreground which are just divided by thin border lines. Do you agree then that it becomes an issue or am I missing something? My idea is, to "collapse" the large foreground areas first to small areas, which might be possible in my case. – Make42 Mar 22 '21 at 12:04
  • I think you under-appreciate how slowly n log n grows. Let's say you have an array with a million elements. Then the ratio between the execution time for the dense array (95% foreground) and the sparse array (10%) is just 10.6. – Paul Brodersen Mar 22 '21 at 12:09
  • Sorry, I meant a billion. For a million, the factor is 11.4. – Paul Brodersen Mar 22 '21 at 12:13
  • What I mean, is that n = y^x or in your example n = 50^4. So "n * log(n)" might grow with slowly for an increase of n, but n itself increase exponentially. So, the entire thing grows with "y^x * log(y^x)" and that is the problem. – Make42 Mar 22 '21 at 12:14
  • Lol, sure, if the arrays have galactic proportions, any execution time starts to become long. The point is: finding neighours can maybe be done quicker, e.g. if you precomputed all possible neighbourhoods around a pixel and then created a hash table that maps neighbourhoods to edges. However, you will not get around the n log n requirement for finding the components in the graph / set of sets / whatever you want to use as a data structure (if you do, let me know and then we write a paper that might net us a Fields medal). So ultimately, your algorithm will always be n log n at best. – Paul Brodersen Mar 22 '21 at 12:22