Detecting comic strip dialogue bubble regions in images

Question

I have an grayscale image of a comic strip page that features several dialogue bubbles (=speech baloons, etc), that are enclosed areas with white background and solid black borders that contain text inside, i.e. something like that:

I want to detect these regions and create a mask (binary is ok) that will cover all the inside regions of dialogue bubbles, i.e. something like:

The same image, mask overlaid, to be totally clear:

So, my basic idea of the algorithm was something like:

Detect where the text is — plant at least one pixel in every bubble. Dilate these regions somewhat and apply threshold to get a better starting ground; I've done this part:

Use a flood fill or some sort of graph traversal, starting from every white pixel detected as a pixel-inside-bubble on step 1, but working on initial image, flooding white pixels (which are supposed to be inside the bubble) and stopping on dark pixels (which are supposed to be borders or text).
Use some sort of binary_closing operation to remove dark areas (i.e. regions that correspond to text) inside bubbles). This part works ok.

So far, steps 1 and 3 work, but I'm struggling with step 2. I'm currently working with scikit-image, and I don't see any ready-made algorithms like flood fill implemented there. Obviously, I can use something trivial like breadth-first traversal, basically as suggested here, but it's really slow when done in Python. I suspect that intricate morphology stuff like binary_erosion or generate_binary_structure in ndimage or scikit-image, but I struggle to understand all that morphology terminology and basically how do I implement such a custom flood fill with it (i.e. starting with step 1 image, working on original image and producing output to separate output image).

I'm open to any suggestions, including ones in OpenCV, etc.

Since these white backgrounds (inside the text bubbles) are contiguous, have you tried connected components? — Stefan van der Walt, Dec 18 '15 at 18:09
Connected components labelling is what I'd love to use *afterwards*, i.e. on the resulting mask to enumerate specific bubbles. I don't see much point to use it on original image. — GreyCat, Dec 19 '15 at 15:43
Flood filling and connected component labelling is very closely related for images like these. If the edges around the bubbles are closed, or can be made closed, this should give you a pretty decent first estimate. Especially since you can measure the properties of such regions, e.g. how square they are, etc. — Stefan van der Walt, Dec 20 '15 at 03:11
You can treat the pixels as nodes of a graph and the vertices would be between neighboring white pixels. The boundaries would be vertices with a degree less than 4 (if you use 4 connectivity or 8 if you use 8-connectivity). You would, of course, end up with a few distinct boundaries which are the speech bubble and the text, and you would be able to distinguish between them by checking which contains which (e.g. by bounding box inclusion tests) — Amnon, Jan 04 '16 at 09:10
Pillow has an undocumented flood fill function that you can check out. https://github.com/python-pillow/Pillow/blob/master/PIL/ImageDraw.py#L367 — Håken Lid, Jan 23 '16 at 08:34
@HåkenLid Thanks, that might be just what I'm looking for! Will do! — GreyCat, Jan 23 '16 at 21:58
Imho, you won't find an algorithm which detects all occurrences in all images, but you will only get to a certain probability (just imagine a cartoon which shows a person with an open comic book showing another bubble, or simply a sheet of paper on a table). So it might be helpful (or necessary) to provide a sample set for benchmarking solutions. — tfv, Feb 03 '16 at 11:48
@GreyCat Can you please briefly describe how you did Step 1? thank you in advance! — schlodinger, Jan 05 '20 at 01:18

score 2 · Answer 1 · answered Apr 09 '16 at 00:57

Even though your actual question is concerning step 2 of your processing pipeline, I would like to suggest another approach, that might be, imho, simpler and as you stated that you are open to suggestions.

Using the image from your original step 1 you could create an image without text in the bubbles.

Implemented
Detect edges on the original image with removed text. This should work well for the speech bubbles, as the bubble edges are pretty distinct.

Edge detection
Finally use the edge image and the initially detected "text locations" in order to find those areas within the edge image that contain text.

Watershed-Segmentation

I am sorry for this very general answer, but here it's too late for actual coding for me, but if the question is still open and you need/want some more hints concerning my suggestion, I will elaborate it in more detail. But you could definitely have a look at the Region based segmentation in the scikit-image docs.

tfv · Answer 2 · 2016-02-03T20:17:18.383

While your overall task aims further, your actual question is about your step 2, how to implement a flood fill algorithm on a data set which has detected text in bubbles.

Since you do not give source code, I had to create something from scratch which hopefully interfaces well with your output from step 1. For this I just took 2 fixed coordinates, you would take white points close to blob centers created from text you have extracted in step 1. As soon as you provide proper code, one can adjust that interface.

I took the liberty to fill all internal holes created by the letters you found, If you do not want this, you can skip the code from line 36 on.

For the solution I have actually taken ideas from two pieces of code which I cited in the snipped below. You may find more helpful information there.

Keep us posted on your progress!

import cv2
import numpy as np

# with ideas from:
# http://www.learnopencv.com/filling-holes-in-an-image-using-opencv-python-c/
# http://stackoverflow.com/questions/10316057/filling-holes-inside-a-binary-object
print cv2.__file__

# Read image
im_in = cv2.imread("gIEXY.png", cv2.IMREAD_GRAYSCALE);

# Threshold.
# Set values equal to or above 200 to 0.
# Set values below 200 to 255.

th, im_th = cv2.threshold(im_in, 200, 255, cv2.THRESH_BINARY_INV);

# Copy the thresholded image.
im_floodfill = im_th.copy()

# Mask used to flood filling.
# Notice the size needs to be 2 pixels than the image.
h, w = im_th.shape[:2]
mask = np.zeros((h+2, w+2), np.uint8)

# Floodfill from points inside baloons
cv2.floodFill(im_floodfill, mask, (80,400), 128);
cv2.floodFill(im_floodfill, mask, (610,90), 128);

# Invert floodfilled image
im_floodfill_inv = cv2.bitwise_not(im_floodfill)

# Combine the two images to get the foreground
im_out = im_th | im_floodfill_inv

# Create binary image from segments with holes
th, im_th2 = cv2.threshold(im_out, 130, 255, cv2.THRESH_BINARY)

# Create contours to fill holes
im_th3 = cv2.bitwise_not(im_th2)
contour,hier = cv2.findContours(im_th3,cv2.RETR_CCOMP,cv2.CHAIN_APPROX_SIMPLE)

for cnt in contour:
    cv2.drawContours(im_th3,[cnt],0,255,-1)

segm = cv2.bitwise_not(im_th3)


# Display image
cv2.imshow("Original", im_in)
cv2.imshow("Segmented", segm)
cv2.waitKey(0)

Detecting comic strip dialogue bubble regions in images

2 Answers2