4

I am trying to extract data from a scanned form. The form has a standard format similar to the one shown in the image below:

enter image description here

I have tried using pytesseract (tesseract OCR) to detect the image's text and it has done a decent job at finding the text and converting the image to text. However it essentially just gives me all the detected text without keeping the format of the data.

I would like to be able to do something like the below:

Find a particular piece of text and then find the associated data below or beside it. Similar to this question using opencv Detect text region in image using Opencv

enter image description here

Is there a way that I can essentially do the following:

  1. Either find all text boxes on the form, perform OCR on each box and see which one is the closest match to the "witnesess:" text, then find the sections immediately below it and perform separate OCR on those.
  2. Or if the form is standard and I know the approximate location of the "witness" text section can I specify its general location in opencv and then just extract the below text and perform OCR on it.

EDIT: I have tried the below code to try to detect specific regions of the text. However it is not specifically identifying the text just all regions.

import cv2

img = cv2.imread('t2.jpg')
mser = cv2.MSER_create()

img = cv2.resize(img, (img.shape[1]*2, img.shape[0]*2))   
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
vis = img.copy()

regions = mser.detectRegions(gray)
hulls = [cv2.convexHull(p.reshape(-1, 1, 2)) for p in regions[0]]
cv2.polylines(vis, hulls, 1, (0,255,0)) 

cv2.imshow('img', vis)

Here is the result:

enter image description here

Mustard Tiger
  • 3,520
  • 8
  • 43
  • 68
  • Tesseract can give you bounding boxes, are you using a wrapper? – juanpa.arrivillaga Aug 15 '17 at 03:09
  • As you seem to have the form in a well defined format, you may manually define some bounding boxes, crop the image and run tesseract on those cropped images individually. – ZdaR Aug 15 '17 at 06:48

1 Answers1

0

I think you have the answer already in your own post. I did recently something similar and this is how I did it:

//id_image was loaded with cv2.imread
temp_image = id_image[start_y:end_y,start_x:end_x]
img = Image.fromarray(temp_image)
text = pytesseract.image_to_string(img, config="-psm 7")

So basically, if your format is predefined, you just need to know the location of the fields that you want the text of (which you already know), crop it, and then apply the ocr (tesseract) extraction.

In this case you need import pytesseract, PIL, cv2, numpy.

roccolocko
  • 562
  • 5
  • 17