How to split an image into clean paragraphs in Python/OpenCV?

Question

TL;DR: How to select a paragraph on an image in such a way that it doesn't contain adjacent (top and bot) paragraphs?

I have a set of scanned images which are single columns of text, such as this one. These images are all black and white, already rotated, their noise is reduced and have white spaces trimmed.

What I want to do is divide each of such images into paragraphs. My initial idea was to measure average brightness of each row to find spaces between lines of text and try to select a rectangle starting from that line to match the indentation and measure the brightness of that rectangle. But that seems a tad cumbersome.

Moreover, the lines are sometimes slightly skewed (up to ≈ 10 px vertical difference on extreme ends), so sometimes there is line overlap. So I thought of selecting all letters of a paragraph and using them to plot a block of text and I got this using this method, but I am not sure how to proceed further. Select each letter rectangle starting n pixels from the left, and try to include every rectangle starting no less than first_rectangle_x - offset? But what then?

Rosa Gronchi · Accepted Answer · 2017-02-21T19:37:44.893

This is specific to the attached paragraph structure. I am not sure whether you need a more general solution but it will probably require additional work:

import cv2
import numpy as np
import matplotlib.pyplot as plt

image = cv2.imread('paragraphs.png', 0)

# find lines by horizontally blurring the image and thresholding
blur = cv2.blur(image, (91,9))
b_mean = np.mean(blur, axis=1)/256

# hist, bin_edges = np.histogram(b_mean, bins=100)
# threshold = bin_edges[66]
threshold = np.percentile(b_mean, 66)
t = b_mean > threshold
'''
get the image row numbers that has text (non zero)
a text line is a consecutive group of image rows that 
are above the threshold and are defined by the first and 
last row numbers
'''
tix = np.where(1-t)
tix = tix[0]
lines = []
start_ix = tix[0]
for ix in range(1, tix.shape[0]-1):
    if tix[ix] == tix[ix-1] + 1:
        continue
    # identified gap between lines, close previous line and start a new one
    end_ix = tix[ix-1]
    lines.append([start_ix, end_ix])
    start_ix = tix[ix]
end_ix = tix[-1]
lines.append([start_ix, end_ix])

l_starts = []
for line in lines:
    center_y = int((line[0] + line[1]) / 2)
    xx = 500
    for x in range(0,500):
        col = image[line[0]:line[1], x]
        if np.min(col) < 64:
            xx = x
            break
    l_starts.append(xx)

median_ls = np.median(l_starts)

paragraphs = []
p_start = lines[0][0]

for ix in range(1, len(lines)):
    if l_starts[ix] > median_ls * 2:
        p_end = lines[ix][0] - 10
        paragraphs.append([p_start, p_end])
        p_start = lines[ix][0]

p_img = np.array(image)
n_cols = p_img.shape[1]
for paragraph in paragraphs:
    cv2.rectangle(p_img, (5, paragraph[0]), (n_cols - 5, paragraph[1]), (128, 128, 0), 5)

cv2.imwrite('paragraphs_out.png', p_img)

input / output

Thanks, this works pretty well for most images—there are exceptions: http://imgur.com/a/z0836. So indeed, I will have some tinkering to do, but that's okay :) — MrVocabulary, Feb 21 '17 at 09:12
Could you, however, explain to me what do the first few lines of the code do? I have trouble understanding what you did there with the histogram. — MrVocabulary, Feb 21 '17 at 09:29
Sure, I'll add comments. The histogram was intended for visualization and was left there. You can just use percentile instead — Rosa Gronchi, Feb 21 '17 at 19:19

How to split an image into clean paragraphs in Python/OpenCV?

1 Answers1