tesseract for empty table cell

Question

I am using tesseract to convert image into text on CentOS instance, but I am not able to handle blank cells.

Output that I get from tesseract:

Legal entity Project Category Mon 08/ 20 Tue 08/ 21 Wed 08/ 22 Thu 08/23 Fri 08/24 Sat 08/25 Sun 08/ 26 Total

test Development Improvements - Improvem 8.00 8.00 8.00 8.00 8.00 40.00 H '9

Please note that in second line there is space after last 8 and before 40 (basically Sat/Sun cell are empty)

kavko · Answer 1 · 2018-09-23T19:18:01.427

I would try to locate the region containing the text before performing the OCR part, making it my ROI. Then for the OCR part use the ROIs instead of the whole image. Then you can search if the ROI contains contours then it should perform OCR else make a blank space. Hope it helps a bit, cheers!

Example:

import cv2
import numpy as np

img = cv2.imread('table_so.png')

res = cv2.resize(img,None,fx=0.8, fy=0.8, interpolation = cv2.INTER_CUBIC)
h,w,ch = res.shape
cv2.rectangle(res, (0,0), (w,h), (0,0,0), 10)

gray = cv2.cvtColor(res, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray,220,255,cv2.THRESH_BINARY)
_, contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
sort_cnts = sorted(contours, key=lambda ctr: cv2.boundingRect(ctr)[0] + cv2.boundingRect(ctr)[1] * res.shape[1] )

ROIs = []

for cnt in sort_cnts:
    x,y,w,h = cv2.boundingRect(cnt)
    if 2000 > w > 70 and h < 100:
        ROI = res[y:y+h, x:x+w]
        ROIs.append(ROI)
        cv2.rectangle(res, (x,y), (x+w,y+h), (0,255,0), 2)

for i in ROIs:
    roi = i
    gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(gray,220,255,cv2.THRESH_BINARY)
    _, contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
    if len(contours) > 1:
        print('DO OCR HERE')
    else:
        print('BLANK SPACE')
    cv2.imshow('img', gray)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

cv2.imshow('img', res)

Result:

(The green boxes are ROIs)

DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
DO OCR HERE
BLANK SPACE
BLANK SPACE
BLANK SPACE

@Nuwan Withanage I don't have much experience with web technologies but after googling I think that you could achieve by implementing OpenCV.js script in your project. Here is the link to the offical OpenCV.js tutorial: https://docs.opencv.org/master/d5/d10/tutorial_js_root.html — kavko, Mar 12 '20 at 05:17
I did this with python and you save my day. Now I need to print the address of the empty cell such as 'first row - 9th column - Empty, first row - 10th column - Empty' from your sample table using ROI, without using OCR function, — Nuwan Withanage, Mar 24 '20 at 06:45
If is it possible to print row wise empty cell count such as '1st row - 3, 2nd row - 1' like wise by using ROIs, thanks a lot — Nuwan Withanage, Mar 24 '20 at 08:19

score 0 · Answer 2 · answered Sep 20 '18 at 11:39

You can either train Tesseract and make it recognize blank spaces (not recommended, since it can mess up the 100% output you're getting), or resolve the issue by coding. Unfortunatelly, there's no way to just train Tesseract the way you want it to.

The best solution i see here is displaying a 0 or something alike (any character you feel comfortable with) at Saturday and Sunday, so that Tesseract can see them and you can react to that.

score 0 · Answer 3 · answered Sep 20 '18 at 23:41

0

Try setting preserve_interword_spaces to 1.

How to preserve document structure in tesseract

Tesseract - ambiguity in space and tab

answered Sep 20 '18 at 23:41

nguyenq

8,212
1
16
16

tesseract for empty table cell

3 Answers3