7

Say I've labeled an image with scipy.ndimage.measurements.label like so:

[[0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 3, 0],
 [2, 2, 0, 0, 0, 0],
 [2, 2, 0, 0, 0, 0]]

What's a fast way to collect the coordinates belonging to each label? I.e. something like:

{ 1: [[0, 1], [1, 1], [2, 1]],
  2: [[4, 0], [4, 1], [5, 0], [5, 1]],
  3: [[3, 4]] }

I'm working with images that are ~15,000 x 5000 pixels in size, and roughly half of each image's pixels are labeled (i.e. non-zero).

Rather than iterating through the entire image with nditer, would it be faster to do something like np.where(img == label) for each label?

EDIT:

Which algorithm is fastest depends on how big the labeled image is as compared to how many labels it has. Warren Weckesser and Salvador Dali / BHAT IRSHAD's methods (which are based on np.nonzero and np.where) all seem to scale linearly with the number of labels, whereas iterating through each image element with nditer obviously scales linearly with the size of labeled image.

The results of a small test:

size: 1000 x 1000, num_labels: 10
weckesser ... 0.214357852936s 
dali ... 0.650229930878s 
nditer ... 6.53645992279s 


size: 1000 x 1000, num_labels: 100
weckesser ... 0.936990022659s 
dali ... 1.33582305908s 
nditer ... 6.81486487389s 


size: 1000 x 1000, num_labels: 1000
weckesser ... 8.43906402588s 
dali ... 9.81333303452s 
nditer ... 7.47897100449s 


size: 1000 x 1000, num_labels: 10000
weckesser ... 100.405524015s 
dali ... 118.17239809s 
nditer ... 9.14583897591s

So the question becomes more specific:

For labeled images in which the number of labels is on the order of sqrt(size(image)) is there an algorithm to gather label coordinates that is faster than iterating through every image element (i.e. with nditer)?

bennlich
  • 1,207
  • 10
  • 21
  • 2
    `np.where(img == label)` (or equivalently `np.nonzero(img == label)`) will be faster than iterating through with `nditer`. However, do you actually need the indices? If you're wanting to extract values of some other array, `data[img == label]` will be faster than `i, j = np.where(img == label); data[j, j]`. – Joe Kington Sep 23 '15 at 21:19
  • @JoeKington Turns out `np.where` will not always be faster than iterating. I just made some edits to my question that reflect this. – bennlich Sep 28 '15 at 15:41
  • Did you eliminate the use of `np.unique` from Salvador Dali and BHAT IRSHAD's answers? – Warren Weckesser Sep 28 '15 at 17:21
  • @WarrenWeckesser yes – bennlich Oct 05 '15 at 23:17
  • Related: [faster alternative to numpy.where?](https://stackoverflow.com/q/33281957/7851470) – Georgy May 25 '20 at 14:38

4 Answers4

4

Here's a possibility:

import numpy as np

a = np.array([[0, 1, 0, 0, 0, 0],
              [0, 1, 0, 0, 0, 0],
              [0, 1, 0, 0, 0, 0],
              [0, 0, 0, 0, 3, 0],
              [2, 2, 0, 0, 0, 0],
              [2, 2, 0, 0, 0, 0]])

# If the array was computed using scipy.ndimage.measurements.label, you
# already know how many labels there are.
num_labels = 3

nz = np.nonzero(a)
coords = np.column_stack(nz)
nzvals = a[nz[0], nz[1]]
res = {k:coords[nzvals == k] for k in range(1, num_labels + 1)}

I called this script get_label_indices.py. Here's a sample run:

In [97]: import pprint

In [98]: run get_label_indices.py

In [99]: pprint.pprint(res)
{1: array([[0, 1],
       [1, 1],
       [2, 1]]),
 2: array([[4, 0],
       [4, 1],
       [5, 0],
       [5, 1]]),
 3: array([[3, 4]])}
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • I think your method is the fastest of the proposed solutions, but I haven't done thorough testing yet. Surprisingly, for large numbers of labels (~30,000), it seems like iterating through each image pixel (~75,000,000 of them) ends up being faster. When I do some thorough comparisons I'll update my question with what I find. – bennlich Sep 24 '15 at 17:16
  • Just got around to doing a more detailed comparison. Lemme know what you think. – bennlich Sep 28 '15 at 15:42
  • @bennlich Nice comparison. I suspected the performance would depend strongly on the density of the labels and on the number of unique labels. – Warren Weckesser Sep 28 '15 at 17:17
2

You can do something like this (let img is your original nd.array)

res = {}
for i in np.unique(img)[1:]:
  x, y = np.where(a == i)
  res[i] = zip(list(x), list(y))

which will give you what you want:

{
 1: [(0, 1), (1, 1), (2, 1)],
 2: [(4, 0), (4, 1), (5, 0), (5, 1)],
 3: [(3, 4)]
}

Whether it will be faster - is up to the benchmark to determine.

Per Warren's suggestion, I do not need to use unique and can just do

res = {}
for i in range(1, num_labels + 1)
    x, y = np.where(a == i)
    res[i] = zip(list(x), list(y))
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
  • Because `img` was computed using `scipy.ndimage.measurements.label()`, it is not necessary to call `np.unique(img)` to get its unique values. We already know the values are 0, 1, 2, ..., `num_labels`. `num_labels` is also known, because it is the second return value of `label()`. So you could replace `np.unique(img)[1:]` with `range(1, num_labels+1)`. – Warren Weckesser Sep 24 '15 at 01:59
0

Try this:

>>> z
array([[0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 3, 0],
       [2, 2, 0, 0, 0, 0],
       [2, 2, 0, 0, 0, 0]])
>>> {i:zip(*np.where(z==i)) for i in np.unique(z) if i}
{1: [(0, 1), (1, 1), (2, 1)], 2: [(4, 0), (4, 1), (5, 0), (5, 1)], 3: [(3, 4)]}
Irshad Bhat
  • 8,479
  • 1
  • 26
  • 36
0

This is basically an argsort operation, with some additional work to get the desired format:

def sorting_based(img, nlabels):
    img_flat = img.ravel()

    label_counts = np.bincount(img_flat)
    lin_idx = np.argsort(img_flat)[label_counts[0]:]
    coor = np.column_stack(np.unravel_index(lin_idx, img.shape))

    ptr = np.cumsum(label_counts[1:-1])
    out = dict(enumerate(np.split(coor, ptr), start=1))

    return out

As you found out, doing np.where(img == label) for each label results in quadratic runtime O(m*n), with m=n_pixels and n=n_labels. The sorting based approach reduces the complexity to O(m*log(m) + n).

It is possible to do this operation in linear time, but I don't think it's possible to vectorize with Numpy. You could abuse the scipy.sparse.csr_matrix similar to this answer, but at that point you're probably better off writing code that actually makes sense, in Numba, Cython, etc.

Community
  • 1
  • 1