2

In the example image (just a reference, my images will be of same pattern) a page which have full horizontal text and other have two horizontal column of text.

enter image description here

How to automatically detect the pattern of the document and read one after the other column of data in python?.

I am using Tesseract OCR with Psm 6, where it is reading horizontally which is wrong.

James Z
  • 12,209
  • 10
  • 24
  • 44

1 Answers1

4

One way to accomplish this is using morphological operations and contour detection.

With the former you essentially "bleed" all characters into a big chunky blob. With the latter, you locate these blobs in your image and extract the ones that seem interesting (meaning: big enough).extracted contours

Script used:

import cv2
import sys

SCALE = 4
AREA_THRESHOLD = 427505.0 / 2

def show_scaled(name, img):
    try:
        h, w  = img.shape
    except ValueError:
        h, w, _  = img.shape
    cv2.imshow(name, cv2.resize(img, (w // SCALE, h // SCALE)))

def main():
    img = cv2.imread(sys.argv[1])
    img = img[10:-10, 10:-10] # remove the border, it confuses contour detection
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    show_scaled("original", gray)

    # black and white, and inverted, because
    # white pixels are treated as objects in
    # contour detection
    thresholded = cv2.adaptiveThreshold(
                gray, 255,
                cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV,
                25,
                15
            )
    show_scaled('thresholded', thresholded)
    # I use a kernel that is wide enough to connect characters
    # but not text blocks, and tall enough to connect lines.
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (13, 33))
    closing = cv2.morphologyEx(thresholded, cv2.MORPH_CLOSE, kernel)

    im2, contours, hierarchy = cv2.findContours(closing, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    show_scaled("closing", closing)

    for contour in contours:
        convex_contour = cv2.convexHull(contour)
        area = cv2.contourArea(convex_contour)
        if area > AREA_THRESHOLD:
            cv2.drawContours(img, [convex_contour], -1, (255,0,0), 3)

    show_scaled("contours", img)
    cv2.imwrite("/tmp/contours.png", img)
    cv2.waitKey()

if __name__ == '__main__':
    main()

Then all you need is to compute the bounding box of the contour, and cut it from the original image. Add a bit of a margin and feed the whole thing to tesseract.

deets
  • 6,285
  • 29
  • 28
  • How to deal this problem: the two columns are not surrounded by two contours respectively but as a whole by only one? – Αλέξανδρος Ζεγγ May 04 '18 at 03:30
  • Try making the structuring element thinner. – deets May 04 '18 at 07:35
  • @deets, the script is working great, but how do i crop the text boxes using python and create new images from the boxes? – Dimitar May 20 '19 at 05:54
  • That’s really basic OpenCV and numpy stuff. Read through a few tutorials. – deets May 23 '19 at 16:29
  • can someone update this code to work with opencv v4 and above, please – Dimitar Oct 06 '19 at 07:23
  • @Dimitar nothing in this code is especially version dependent. What parts of it give you trouble, and what have you researched on the documentation of those calls? – deets Oct 06 '19 at 16:42
  • 1
    @deets, this line of code `im2, contours, hierarchy = cv2.findContours(closing, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)` its getting me error in opencv v4 – Dimitar Oct 07 '19 at 12:59
  • Have you looked at the documentation of that method, and how it has changed between versions? Have you looked at the documentation of *older* versions (3 or 2) how that method *used* to behave, to get an understanding of how the behaviour has changed, and what to do about that? – deets Oct 07 '19 at 14:42
  • To get past the error in OpenCV 4, remove `im2, ` from the beginning of the line that calls `findContours`. – toohool May 02 '20 at 03:46
  • @deets how do u fixed AREA_THRESHOLD , what any geeral formula for keeping area threshold – k_p May 26 '22 at 10:37
  • @k_p based on the actual data. That was just what the concrete example suggested. So no, there is no general formula. At least not right here, right now. You could of course create one based on scan resolution, and the concrete layout used, including margins etc, but that's a lot of effort, and not worth it IMHO. – deets May 26 '22 at 10:43
  • and in this case after getting contour how you are appying OCR , can u add that part for more clearity – k_p May 26 '22 at 11:04
  • I won't change this recipe, as that's not the question that was asked. *If* I find time, I'll look into your other question. – deets May 27 '22 at 11:54