Detect words and graphs in image and slice image into 1 image per word or graph

Question

I'm building a web app to help students with learning Maths.

The app needs to display Maths content that comes from LaTex files. These Latex files render (beautifully) to pdf that I can convert cleanly to svg thanks to pdf2svg.

The (svg or png or whatever image format) image looks something like this:

 _______________________________________
|                                       |
| 1. Word1 word2 word3 word4            |
|    a. Word5 word6 word7               |
|                                       |
|   ///////////Graph1///////////        |
|                                       |
|    b. Word8 word9 word10              |
|                                       |
| 2. Word11 word12 word13 word14        |
|                                       |
|_______________________________________|

Real example:

The web app intent is to manipulate and add content to this, leading to something like this:

 _______________________________________
|                                       |
| 1. Word1 word2                        | <-- New line break
|_______________________________________|
|                                       |
| -> NewContent1                        |  
|_______________________________________|
|                                       |
|   word3 word4                         |  
|_______________________________________|
|                                       |
| -> NewContent2                        |  
|_______________________________________|
|                                       |
|    a. Word5 word6 word7               |
|_______________________________________|
|                                       |
|   ///////////Graph1///////////        |
|_______________________________________|
|                                       |
| -> NewContent3                        |  
|_______________________________________|
|                                       |
|    b. Word8 word9 word10              |
|_______________________________________|
|                                       |
| 2. Word11 word12 word13 word14        |
|_______________________________________|

Example:

A large single image cannot give me the flexibility to do this kind of manipulations.

But if the image file was broken down into smaller files which hold single words and single Graphs I could do these manipulations.

What I think I need to do is detect whitespace in the image, and slice the image into multiple sub-images, looking something like this:

 _______________________________________
|          |       |       |            |
| 1. Word1 | word2 | word3 | word4      |
|__________|_______|_______|____________|
|             |       |                 |
|    a. Word5 | word6 | word7           |
|_____________|_______|_________________|
|                                       |
|   ///////////Graph1///////////        |
|_______________________________________|
|             |       |                 |
|    b. Word8 | word9 | word10          |
|_____________|_______|_________________|
|           |        |        |         |
| 2. Word11 | word12 | word13 | word14  |
|___________|________|________|_________|

I'm looking for a way to do this. What do you think is the way to go?

Thank you for your help!

Vertical and horizontal projection. First segment whole image into rows, then each row into columns. — Dan Mašek, Aug 19 '17 at 14:58
Thank you Dan. I get the idea. What tool would you use for vertical and horizontal projection? Can it be automated? Can it detect rows and columns? — lami, Aug 19 '17 at 15:04
What you do is basically calculate the average intensity per row (e.g. using `cv2.reduce`. Use that to identify the white gaps between rows. Find midpoints of the gaps. Use those as cut-points to generate a set of images, one per line of text/graph. Now repeat the same thing per column. — Dan Mašek, Aug 19 '17 at 15:12

score 8 · Accepted Answer · answered Aug 19 '17 at 16:50

I would use horizontal and vertical projection to first segment the image into lines, and then each line into smaller slices (e.g. words).

Start by converting the image to grayscale, and then invert it, so that gaps contain zeros and any text/graphics are non-zero.

img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray

Calculate horizontal projection -- mean intensity per row, using cv2.reduce, and flatten it to a linear array.

row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()

Now find the row ranges for all the contiguous gaps. You can use the function provided in this answer.

row_gaps = zero_runs(row_means)

Finally calculate the midpoints of the gaps, that we will use to cut the image up.

row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2

You end up with something like this situation (gaps are pink, cutpoints red):

Next step would be to process each identified line.

bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
    line = img[start:end]
    line_gray_inverted = img_gray_inverted[start:end]

Calculate the vertical projection (average intensity per column), find the gaps and cutpoints. Additionally, calculate gap sizes, to allow filtering out the small gaps between individual letters.

column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
column_gaps = zero_runs(column_means)
column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2

Filter the cutpoints.

filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]

And create a list of bounding boxes for each segment.

for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
    bounding_boxes.append(((xstart, start), (xend, end)))

Now you end up with something like this (again gaps are pink, cutpoints red):

Now you can cut up the image. I'll just visualize the bounding boxes found:

The full script:

import cv2
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec


def plot_horizontal_projection(file_name, img, projection):
    fig = plt.figure(1, figsize=(12,16))
    gs = gridspec.GridSpec(1, 2, width_ratios=[3,1])

    ax = plt.subplot(gs[0])
    im = ax.imshow(img, interpolation='nearest', aspect='auto')
    ax.grid(which='major', alpha=0.5)

    ax = plt.subplot(gs[1])
    ax.plot(projection, np.arange(img.shape[0]), 'm')
    ax.grid(which='major', alpha=0.5)
    plt.xlim([0.0, 255.0])
    plt.ylim([-0.5, img.shape[0] - 0.5])
    ax.invert_yaxis()

    fig.suptitle("FOO", fontsize=16)
    gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])  

    fig.set_dpi(200)

    fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
    plt.clf() 

def plot_vertical_projection(file_name, img, projection):
    fig = plt.figure(2, figsize=(12, 4))
    gs = gridspec.GridSpec(2, 1, height_ratios=[1,5])

    ax = plt.subplot(gs[0])
    im = ax.imshow(img, interpolation='nearest', aspect='auto')
    ax.grid(which='major', alpha=0.5)

    ax = plt.subplot(gs[1])
    ax.plot(np.arange(img.shape[1]), projection, 'm')
    ax.grid(which='major', alpha=0.5)
    plt.xlim([-0.5, img.shape[1] - 0.5])
    plt.ylim([0.0, 255.0])

    fig.suptitle("FOO", fontsize=16)
    gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])  

    fig.set_dpi(200)

    fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
    plt.clf() 

def visualize_hp(file_name, img, row_means, row_cutpoints):
    row_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    row_highlight[row_means == 0, :, :] = [255,191,191]
    row_highlight[row_cutpoints, :, :] = [255,0,0]
    plot_horizontal_projection(file_name, row_highlight, row_means)

def visualize_vp(file_name, img, column_means, column_cutpoints):
    col_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    col_highlight[:, column_means == 0, :] = [255,191,191]
    col_highlight[:, column_cutpoints, :] = [255,0,0]
    plot_vertical_projection(file_name, col_highlight, column_means)


# From https://stackoverflow.com/a/24892274/3962537
def zero_runs(a):
    # Create an array that is 1 where a is 0, and pad each end with an extra 0.
    iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(iszero))
    # Runs start and end where absdiff is 1.
    ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
    return ranges


img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray

row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
row_gaps = zero_runs(row_means)
row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2

visualize_hp("article_hp.png", img, row_means, row_cutpoints)

bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
    line = img[start:end]
    line_gray_inverted = img_gray_inverted[start:end]

    column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
    column_gaps = zero_runs(column_means)
    column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
    column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2

    filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]

    for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
        bounding_boxes.append(((xstart, start), (xend, end)))

    visualize_vp("article_vp_%02d.png" % n, line, column_means, filtered_cutpoints)

result = img.copy()

for bounding_box in bounding_boxes:
    cv2.rectangle(result, bounding_box[0], bounding_box[1], (255,0,0), 2)

cv2.imwrite("article_boxes.png", result)

OpenCV cannot load and write .svg files if I understand correctly? It would allow perfect display at any scale. Is there any vectorial image format that OpenCV handles? — lami, Aug 24 '17 at 09:36
As far as I can tell, [it can't](https://github.com/opencv/opencv/tree/master/modules/imgcodecs/src). When you think about it, unless you render it, it won't be a raster image, so the approach would likely need to be different. (TBH, I'd need to do some research to give you a good answer to that) Although one possibility comes to mind, but it's just a quick thought -- render and find the bounding boxes using the current approach, then use the coordinates to find the corresponding pieces of the SVG. — Dan Mašek, Aug 24 '17 at 13:34
That makes a lot of sense. I'm going to look into this direction (detect bounding boxes with opencv and slice svg). I can't thank you enough! — lami, Aug 24 '17 at 14:25
how can you find `zero_runs` if `dtype=cv2.CV_32F` ? It only works if the white is perfect white, with absolutely zero noise, right ? — Ciprian Tomoiagă, Oct 17 '17 at 11:21
@CiprianTomoiaga Yes. In this case that's sufficient, since the input images are computer generated (and therefore don't contain any noise). — Dan Mašek, Oct 17 '17 at 12:32
@DanMašek Fantastic solution,insightful and can be used in other ways. For example, kindly look at [image](https://i.imgur.com/iANzOYR.png) having 4/5 pixels thick lines. I want to group lines by thickness and find their XY position. I tweaked your solution in a nested way. Horizontal lines: compute row projections, find a set of 4/5 continuous rows (line height) having equal intensity, for this set find some number of continuous columns (line width) having exactly same intensity. The run of rows and columns will give XY positions. Can you tell is this way good or any in-built function exists? — SKR, Nov 16 '18 at 20:12
@SKR If it meets your requirements, and gives good results, then I'd call it good :) There might be other approaches, but that's probably better for a new question. I'm not really aware of any built-in function of OpenCV that would do it all at once. — Dan Mašek, Nov 17 '18 at 19:11

score 1 · Answer 2 · 2017-08-19T16:35:45.140

The image is top quality, perfectly clean, not skewed, well separated characters. A dream !

First perform binarization and blob detection (standard in OpenCV).

Then cluster the characters by grouping those with an overlap in the ordinates (i.e. facing each other in a row). This will naturally isolate the individual lines.

Now in every row, sort the blobs left-to-right and cluster by proximity to isolate the words. This will be a delicate step, because the spacing of characters within a word is close to the spacing between distinct words. Don't expect perfect results. This should work better than a projection.

The situation is worse with italics as the horizontal spacing is even narrower. You may have to also look at the "slanted distance", i.e. find the lines that tangent the characters in the direction of the italics. This can be achieved by applying a reverse shear transform.

Thanks to the grid, the graphs will appear as big blobs.

Thank you Yves, I will look into this – lami Aug 19 '17 at 20:10 — lami, Aug 19 '17 at 20:10

Detect words and graphs in image and slice image into 1 image per word or graph

2 Answers2