Cluster bounding boxes and draw line on them (OpenCV, Python)

Question

With this code I created some bounding boxes around the characters in the below image:

import csv
import cv2
from pytesseract import pytesseract as pt

pt.run_tesseract('bb.png', 'output', lang=None, boxes=True, config="hocr")

# To read the coordinates
boxes = []
with open('output.box', 'rt') as f:
    reader = csv.reader(f, delimiter=' ')
    for row in reader:
        if len(row) == 6:
            boxes.append(row)

# Draw the bounding box
img = cv2.imread('bb.png')
h, w, _ = img.shape
for b in boxes:
    img = cv2.rectangle(img, (int(b[1]), h-int(b[2])), (int(b[3]), h-int(b[4])), (0, 255, 0), 2)

cv2.imshow('output', img)
cv2.waitKey(0)

OUTPUT

What I would like to have is this:

The program should draw a perpendicular line on the X axis of the bounding box (only for the first and third text-area. The one in the middle must not be interested in the process).

The goal is this (and of there is another way to achieve it, please explain): once I have this two lines (or, better, group of coordinates), using a mask to cover this two areas.

Is it possible ?

Source image:

CSV as requested: print(boxes)

[['l', '56', '328', '63', '365', '0'], ['i', '69', '328', '76', '365', '0'], ['n', '81', '328', '104', '354', '0'], ['e', '108', '328', '130', '354', '0'], ['1', '147', '328', '161', '362', '0'], ['m', '102', '193', '151', '227', '0'], ['i', '158', '193', '167', '242', '0'], ['d', '173', '192', '204', '242', '0'], ['d', '209', '192', '240', '242', '0'], ['l', '247', '193', '256', '242', '0'], ['e', '262', '192', '292', '227', '0'], ['t', '310', '192', '331', '235', '0'], ['e', '334', '192', '364', '227', '0'], ['x', '367', '193', '398', '227', '0'], ['t', '399', '192', '420', '235', '0'], ['-', '440', '209', '458', '216', '0'], ['n', '481', '193', '511', '227', '0'], ['o', '516', '192', '548', '227', '0'], ['n', '553', '193', '583', '227', '0'], ['t', '602', '192', '623', '235', '0'], ['o', '626', '192', '658', '227', '0'], ['t', '676', '192', '697', '235', '0'], ['o', '700', '192', '732', '227', '0'], ['u', '737', '192', '767', '227', '0'], ['c', '772', '192', '802', '227', '0'], ['h', '806', '193', '836', '242', '0'], ['l', '597', '49', '604', '86', '0'], ['i', '610', '49', '617', '86', '0'], ['n', '622', '49', '645', '75', '0'], ['e', '649', '49', '671', '75', '0'], ['2', '686', '49', '710', '83', '0']]

EDIT:

To use zindarod answer, you need tesserocr. Installation through pip install tesserocr can give you various errors. I found wheel version of it (after hours trying to install and solve errors, see my comment below the answer...): here you can find/download it.

Hope this helps..

I would suggest you to cluster the bounding boxes, then get the max y in the line 1 cluster, and then the min y in the line 2 cluster, and create a rectangle using the 2 y and all the width to have the mask. — api55, Nov 13 '17 at 09:20
It seems right. Do you know how to do it ? Also, I found another keyword for this research: "Connected-component labeling". — lucians, Nov 13 '17 at 09:25
Connected component won't do. This works if all of them are connected somehow. But you can use k-means with their y values and k = 3. Then you will have 3 clusters of letters depending on their y value. kmeans is implemented in opencv — api55, Nov 13 '17 at 09:28
I am reading about it right now. But I don't know how (or where) to implement this in my code...Seems simple from the docs but....how ? — lucians, Nov 13 '17 at 09:38
after you find the boxes, you have 2 y coordinates for each of them (top and bottom) you can average them, to get 1 y value per letter. This will be an array that you pass to kmeans, then kmeans will label each value (each y from each letter) as 1,2,3 (not sure if it is 0,1,2 though) No you can put each group of letters in a box. From there you can get the values needed to create a mask... I can write a complete answer, but in a few hours. Can you post the csv and the initial image? to be able to test it — api55, Nov 13 '17 at 09:45
Thank you very much. Added details as you requested. I'll look more at you answer and at the code to see what I can do. — lucians, Nov 13 '17 at 09:51
Have a look [here](https://stackoverflow.com/q/34981144/5008845) — Miki, Nov 13 '17 at 14:24
Thanks. I saw that question 2 days ago but I didn't understand too much apart of what you are literally explaining. Can this be applied to my question ? Also, I was looking for a solution using @api55 hint. I have a .box file generated by the code above (or the print(boxes) output). I discovered that the first and last values are the X and Y coordinates. I still don't know what the two in the middle means... — lucians, Nov 13 '17 at 14:30

score 10 · Accepted Answer · answered Nov 13 '17 at 14:44

Googles tesseract-ocr already has this functionality in page segmentation method(psm). You just need to use a better python wrapper, which exposes more of tesseract's functionalities than pytesseract does. One of the better ones is tesserocr.

A simple example with your image:

  import cv2
  import numpy as np
  import tesserocr as tr
  from PIL import Image

  cv_img = cv2.imread('text.png', cv2.IMREAD_UNCHANGED)

  # since tesserocr accepts PIL images, converting opencv image to pil
  pil_img = Image.fromarray(cv2.cvtColor(cv_img,cv2.COLOR_BGR2RGB))

  #initialize api
  api = tr.PyTessBaseAPI()
  try:
    # set pil image for ocr
    api.SetImage(pil_img)
    # Google tesseract-ocr has a page segmentation methos(psm) option for specifying ocr types
    # psm values can be: block of text, single text line, single word, single character etc.
    # api.GetComponentImages method exposes this functionality
    # function returns:
    # image (:class:`PIL.Image`): Image object.
    # bounding box (dict): dict with x, y, w, h keys.
    # block id (int): textline block id (if blockids is ``True``). ``None`` otherwise.
    # paragraph id (int): textline paragraph id within its block (if paraids is True).
    # ``None`` otherwise.
    boxes = api.GetComponentImages(tr.RIL.TEXTLINE,True)
    # get text
    text = api.GetUTF8Text()
    # iterate over returned list, draw rectangles
    for (im,box,_,_) in boxes:
      x,y,w,h = box['x'],box['y'],box['w'],box['h']
      cv2.rectangle(cv_img, (x,y), (x+w,y+h), color=(0,0,255))
  finally:
    api.End()

  cv2.imshow('output', cv_img)
  cv2.waitKey(0)
  cv2.destroyAllWindows()

Yep, but it gives me error while trying to install tesserocr. Actually I am using also pyocr as wrapper. Can't it be done with this one ? Thanks. Seem prefect for what I want to do right now.. Error: `python setup.py egg_info" failed with error code 1`..Searching for solution.. — lucians, Nov 13 '17 at 14:59
@api55 opencv already has a *text* module which builds on google's tesseract-ocr, unfortunately the python API does not expose the *psm* functionality. — zindarod, Nov 13 '17 at 15:02
@Link *Pytesseract* just converts your function arguments to command line arguments for tesseract. I couldn't figure out the *psm* method through command line but maybe you can. — zindarod, Nov 13 '17 at 15:08
@zindarod trying to do the job with [this image](https://i.imgur.com/nbHUWK6.png) don't work in detecting block of text.. Edit: the text-block are too near. If I separate them the script works..Is there a way to do the job with the image as is ? — lucians, Nov 14 '17 at 09:19

score 0 · Answer 2 · answered Feb 15 '18 at 14:53

I am late here, searching for something else. I have never used the tesser wrappers, they just seem to get in the way for no real benefit. All they are doing is abstracting away the call to the subprocess?

This is how I access psm configuration through the args passed to a subprocess. I have included oem, pdf and hocr parameters as well just for completeness but this is not necessary, you can just pass the psm parameter. Do make the help call at the terminal as there are 13 psm options and 4 for oem. Depending on what you are doing the quality can be highly dependent on psm.

It is possible to Pipe in and out using subprocess.Popen() or if you are feeling adventurous you can do it asynchronously with asyncio.create_subprocess_exec()in much the same way.

import subprocess

# args 
# 'tesseract' - the executable name
# path to the image file
# output file name - no extension tesser will add .txt .pdf .hocr etc etc
# optional params
# -psm x to set the page segmentation mode see more with tesseract --help-psm at the cli
# -oem x to set ocr engine mode see more with tesseract --help-osm
# can add a mode parameter to the end of the args list to get output in :
# searchable pdf - just add a parameter 'pdf' as below
# hOCR output (html) - just add 'hocr' as below

args = ['tesseract', 'Im1.tiff', 'Im1', '-psm 1', '-oem 2']

# args = ['tesseract', 'Im1.tiff', 'Im1', '-psm 1', '-oem 2', 'pdf']
# args = ['tesseract', 'Im1.tiff', 'Im1', '-psm 1', '-oem 2', 'hocr']

try:
    proc = subprocess.check_call(args)
    print('subprocess retcode {r}'.format(r=proc))
except subprocess.CalledProcessError as exp:
    print('subprocess.CalledProcessError : ', exp)

Cluster bounding boxes and draw line on them (OpenCV, Python)

2 Answers2