3

Image to read

I am using tesseract to convert image into text on CentOS instance, but I am not able to handle blank cells.

Output that I get from tesseract:

Legal entity Project Category Mon 08/ 20 Tue 08/ 21 Wed 08/ 22 Thu 08/23 Fri 08/24 Sat 08/25 Sun 08/ 26 Total

test Development Improvements - Improvem 8.00 8.00 8.00 8.00 8.00 40.00 H '9

Please note that in second line there is space after last 8 and before 40 (basically Sat/Sun cell are empty)

Community
  • 1
  • 1
Vipin Choudhary
  • 331
  • 1
  • 2
  • 16

3 Answers3

1

I would try to locate the region containing the text before performing the OCR part, making it my ROI. Then for the OCR part use the ROIs instead of the whole image. Then you can search if the ROI contains contours then it should perform OCR else make a blank space. Hope it helps a bit, cheers!

Example:

import cv2
import numpy as np

img = cv2.imread('table_so.png')

res = cv2.resize(img,None,fx=0.8, fy=0.8, interpolation = cv2.INTER_CUBIC)
h,w,ch = res.shape
cv2.rectangle(res, (0,0), (w,h), (0,0,0), 10)

gray = cv2.cvtColor(res, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray,220,255,cv2.THRESH_BINARY)
_, contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
sort_cnts = sorted(contours, key=lambda ctr: cv2.boundingRect(ctr)[0] + cv2.boundingRect(ctr)[1] * res.shape[1] )

ROIs = []

for cnt in sort_cnts:
    x,y,w,h = cv2.boundingRect(cnt)
    if 2000 > w > 70 and h < 100:
        ROI = res[y:y+h, x:x+w]
        ROIs.append(ROI)
        cv2.rectangle(res, (x,y), (x+w,y+h), (0,255,0), 2)

for i in ROIs:
    roi = i
    gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(gray,220,255,cv2.THRESH_BINARY)
    _, contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
    if len(contours) > 1:
        print('DO OCR HERE')
    else:
        print('BLANK SPACE')
    cv2.imshow('img', gray)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

cv2.imshow('img', res)

Result:

enter image description here

(The green boxes are ROIs)

  • DO OCR HERE
  • DO OCR HERE
  • DO OCR HERE
  • DO OCR HERE
  • DO OCR HERE
  • DO OCR HERE
  • DO OCR HERE
  • DO OCR HERE
  • DO OCR HERE
  • BLANK SPACE
  • BLANK SPACE
  • BLANK SPACE
kavko
  • 2,751
  • 13
  • 27
  • Is this possible to do by Javascript with laravel? – Nuwan Withanage Mar 12 '20 at 05:05
  • @Nuwan Withanage I don't have much experience with web technologies but after googling I think that you could achieve by implementing OpenCV.js script in your project. Here is the link to the offical OpenCV.js tutorial: https://docs.opencv.org/master/d5/d10/tutorial_js_root.html – kavko Mar 12 '20 at 05:17
  • I did this with python and you save my day. Now I need to print the address of the empty cell such as 'first row - 9th column - Empty, first row - 10th column - Empty' from your sample table using ROI, without using OCR function, – Nuwan Withanage Mar 24 '20 at 06:45
  • If is it possible to print row wise empty cell count such as '1st row - 3, 2nd row - 1' like wise by using ROIs, thanks a lot – Nuwan Withanage Mar 24 '20 at 08:19
0

You can either train Tesseract and make it recognize blank spaces (not recommended, since it can mess up the 100% output you're getting), or resolve the issue by coding. Unfortunatelly, there's no way to just train Tesseract the way you want it to.

The best solution i see here is displaying a 0 or something alike (any character you feel comfortable with) at Saturday and Sunday, so that Tesseract can see them and you can react to that.

Enashgrow
  • 71
  • 5
0

Try setting preserve_interword_spaces to 1.

How to preserve document structure in tesseract

Tesseract - ambiguity in space and tab

nguyenq
  • 8,212
  • 1
  • 16
  • 16