How to use python OpenCV to find largest connected component in a single channel image that matches a specific value?

Question

So I have a single channel image that is mostly 0s (background), and some values for foreground pixels like 20, 21, 22. The nonzero foreground pixels are mostly clustered together with other foreground pixels with the same value. However, there is some noise in the image. To get rid of the noise, I want to use connected components analysis, and for each value (in this case 20, 21, 22), zero out everything but the largest connected component. So in the end, I will have 3 large connected components and no noise. How would I use cv2.connectedComponentsWithStats to accomplish this? It seems poorly documented and even after looking at this post, I don't fully understand how to parse the return value of the function. Is there a way to specify to the function that I only want connected components matching a specific greyscale value?

What about just masking out the given intensity and running the analysis on that? — Dan Mašek, Nov 28 '17 at 00:07
So you mean that there might be many distinct regions with values of 20, but you only want the largest of them for each value? — alkasm, Nov 28 '17 at 00:07

alkasm · Accepted Answer · 2017-11-28T00:57:40.913

Here's the general approach:

Create a new blank image to add the components into
Loop through each distinct non-zero value in your image
Create a mask for each value (giving the multiple blobs per value)
Run connectedComponentsWithStats() on the mask
Find the non-zero label corresponding to the largest area
Create a mask with the largest label and insert the value into the new image at the masked positions

The annoying thing here is step 5, because the value of 0 will usually, but not always be the largest component. So we need to get the largest non-zero component by area.

Here's some code which I think achieves everything (some sample images would be nice to be sure):

import cv2
import numpy as np

img = np.array([
    [1, 0, 1, 1, 2],
    [1, 0, 1, 1, 2],
    [1, 0, 1, 1, 2],
    [1, 0, 1, 1, 2],
    [1, 0, 1, 1, 2]], dtype=np.uint8)

new_img = np.zeros_like(img)                                        # step 1
for val in np.unique(img)[1:]:                                      # step 2
    mask = np.uint8(img == val)                                     # step 3
    labels, stats = cv2.connectedComponentsWithStats(mask, 4)[1:3]  # step 4
    largest_label = 1 + np.argmax(stats[1:, cv2.CC_STAT_AREA])      # step 5
    new_img[labels == largest_label] = val                          # step 6

print(new_img)

Showing the desired output:

[[0 0 1 1 2]
 [0 0 1 1 2]
 [0 0 1 1 2]
 [0 0 1 1 2]
 [0 0 1 1 2]]

To go through the code, first we create the new labeled image, unimaginatively called new_img, filled with zeros to be populated later by the correct label. Then, np.unique() finds the unique values in the image, and I'm taking everything except the first value; note that np.unique() returns a sorted array, so 0 will be the first value and we don't need to find components of zero. For each unique val, create a mask populated with 0s and 1s, and run connected components on this mask. This will label each distinct region with a different label. Then we can grab the largest non-zero labeled component**, create a mask for it, and add that val into the new image at that place.

** This is the annoying bit that looks weird in the code.

largest_label = 1 + np.argmax(stats[1:, cv2.CC_STAT_AREA])

First, you can check out the answer you linked for the shape of the stats array, but each row corresponds to a label (so the label 0 will correspond to the first row, etc), and the column is defined by the integer cv2.CC_STAT_AREA (which is just 4). We'll need to make sure we're looking at the largest non-zero label, so I'm only looking at rows past the first one. Then, grab the index corresponding to the largest area. Since we shaved the zero row off, the index now corresponds to label-1, so add 1 to get the correct label. Then we can mask as usual and insert the value.

Thanks for the thorough explanation. I really appreciate it. I still had one question. In this case, shouldn't stats always have just 2 rows, since the mask only has 2 labels (0 and val)? in that case, is there a problem with just accessing row 1 directly using `[1]` rather than `[1:]`? Or am I misunderstanding the use of the term "label" with regards to the connected components stats? — Terry Martin, Nov 28 '17 at 15:04
@Terry Martin you're misunderstanding the label. Connected components labels each component with a...label. If the blobs are separate, they are a different component. So even like a 3x3 image with 1s on the left and right and 0s in the middle, a connected component of that image would have them 1s on the left, 0s in the middle, 2s on the right. Each connected component gets a new label, and that's how components of the same color are distinguished. — alkasm, Nov 28 '17 at 15:16

How to use python OpenCV to find largest connected component in a single channel image that matches a specific value?

1 Answers1

Linked