15

I am trying to identify paragraphs of text in a .pdf document by first converting it into an image then using OpenCV. But I am getting bounding boxes on lines of text instead of paragraphs. How can I set some threshold or some other limit to get paragraphs instead of lines?

Here is the sample input image:

input

Here is the output I am getting for the above sample:

output

I am trying to get a single bounding box on the paragraph in the middle. I am using this code.

import cv2
import numpy as np

large = cv2.imread('sample image.png')
rgb = cv2.pyrDown(large)
small = cv2.cvtColor(rgb, cv2.COLOR_BGR2GRAY)

# kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
kernel = np.ones((5, 5), np.uint8)
grad = cv2.morphologyEx(small, cv2.MORPH_GRADIENT, kernel)

_, bw = cv2.threshold(grad, 0.0, 255.0, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9, 1))
connected = cv2.morphologyEx(bw, cv2.MORPH_CLOSE, kernel)

# using RETR_EXTERNAL instead of RETR_CCOMP
contours, hierarchy = cv2.findContours(connected.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
#For opencv 3+ comment the previous line and uncomment the following line
#_, contours, hierarchy = cv2.findContours(connected.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

mask = np.zeros(bw.shape, dtype=np.uint8)

for idx in range(len(contours)):
    x, y, w, h = cv2.boundingRect(contours[idx])
    mask[y:y+h, x:x+w] = 0
    cv2.drawContours(mask, contours, idx, (255, 255, 255), -1)
    r = float(cv2.countNonZero(mask[y:y+h, x:x+w])) / (w * h)

    if r > 0.45 and w > 8 and h > 8:
        cv2.rectangle(rgb, (x, y), (x+w-1, y+h-1), (0, 255, 0), 2)


cv2.imshow('rects', rgb)
cv2.waitKey(0)
nathancy
  • 42,661
  • 14
  • 115
  • 137
Achal Gambhir
  • 169
  • 1
  • 2
  • 9
  • Can you provide the sample image to debug as well? – LazyCoder Jul 29 '19 at 07:49
  • 2
    Possible duplicate of [Easy ways to detect and crop blocks (paragraphs) of text out of image?](https://stackoverflow.com/questions/42174563/easy-ways-to-detect-and-crop-blocks-paragraphs-of-text-out-of-image) – LazyCoder Jul 29 '19 at 07:52
  • 3
    One procedure could be to cluster all the bounding boxes based on adjacency. Sort the list of bounding boxes according to their starting y co-ordinate. If the difference between the ending y co-ordinate of a bounding box and the starting y coordinate of the immediate next bounding box is less than a certain threshold, you can cluster them as constituting a single paragraph. – Arkistarvh Kltzuonstev Jul 29 '19 at 07:56
  • @LazyCoder The link you provided can help but as I am not that much experienced in c++ or openCV I was not able to interpret the answer provided in that link. – Achal Gambhir Jul 29 '19 at 09:07
  • @ArkistarvhKltzuonstev's solution is what you need bro. Good luck! – LazyCoder Jul 29 '19 at 09:54
  • Blur the image a bit first and threshold so the lines of text in the middle come together, but do not span across the gap between paragraphs – fmw42 Jul 29 '19 at 19:31

1 Answers1

30

This is a classic situation for dilate. Whenever you want to connect multiple items together, you can dilate them to join adjacent contours into a single contour. Here's a simple approach:

  1. Obtain binary image. Load the image, convert to grayscale, Gaussian blur, then Otsu's threshold to obtain a binary image.

  2. Connect adjacent words together. We create a rectangular kernel and dilate to merge individual contours together.

  3. Detect paragraphs. From here we find contours, obtain the rectangular bounding rectangle coordinates and highlight the rectangular contours.


Otsu's threshold to obtain a binary image

enter image description here

Here's where the magic happens. We can assume that a paragraph is a section of words that are close together, to achieve this we dilate to connect adjacent words

enter image description here

Result

enter image description here

import cv2
import numpy as np

# Load image, grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (7,7), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Create rectangular structuring element and dilate
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
dilate = cv2.dilate(thresh, kernel, iterations=4)

# Find contours and draw rectangle
cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 2)

cv2.imshow('thresh', thresh)
cv2.imshow('dilate', dilate)
cv2.imshow('image', image)
cv2.waitKey()
nathancy
  • 42,661
  • 14
  • 115
  • 137