3

I am working on a contract sheet with OpenCV and pytesseract. I want to extract words from this image

This image report-image

I am trying with getStructureElement but my code jumps on the next line in the center of the image. I'm trying to extract words from the left side of image and after extracting string from all left then move to right side of image.

The code is:

import cv2, import pytesseract, from PIL import Image

image = cv2.imread("report_name-1.jpg")

#preprocessing 

gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY) # grayscale

thresh = cv2.threshold(gray,150,255,cv2.THRESH_BINARY_INV) # threshold

kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(3,3))

dilated = cv2.erode(thresh,kernel,iterations = 13) # dilate

contours, hierarchy =cv2.findContours(dilated,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE) # get contours

# get rectangle bounding contour
[x,y,w,h] = cv2.boundingRect(contour)
# discard areas that are too large
if h>300 and w>300:
    continue

# discard areas that are too small
if h<40 or w<40:
    continue

# draw rectangle around contour on original image
cv2.rectangle(image,(x,y),(x+w,y+h),(255,0,255),2)
nathancy
  • 42,661
  • 14
  • 115
  • 137
xpertdev
  • 1,293
  • 2
  • 6
  • 12
  • Which version of OpenCV? findContours number of return values and syntax depends upon the version. Perhaps you are using the wrong syntax? What does your output for "image" look like after drawing the rectangles? Does it look OK? It is always a good idea to view your intermediate results when developing code to verify each step is processing as you would expect. – fmw42 Oct 23 '19 at 20:02
  • i am using opencv 4.1.1. My apologizes, I uploaded the box image now. Please check it below in answer section. You can see the boxes are separated from each other in horizontal axis – xpertdev Oct 23 '19 at 20:24
  • Are you trying to extract text from left-to-right and top to bottom? – nathancy Oct 23 '19 at 20:50
  • @nathancy. yes i am trying to extract text from left to right and top to bottom. even small boxes of months Jan/C Feb/C March/C.. etc are separated from each other. you can check it in box image(attachment). – xpertdev Oct 23 '19 at 20:59
  • @nathancy there are 4 portion in image. Lexus financial service, BMW financial service ,AE/Suntrust bank and Barclays bank Delaware. i am trying to extract text from left to right in only for Lexus Financial Service and then for BMW Financial etc – xpertdev Oct 23 '19 at 21:04
  • @xpertdev you can do filtering with the output data, the headers for each service was detected so you can just extract the section that you need – nathancy Oct 23 '19 at 21:09
  • @nathancy fitlering the output data with regex am i right ? or Opencv have any modules for filtering the output. – xpertdev Oct 24 '19 at 14:19
  • @xpertdev the output is a string so you can use regex or any other type of filtering. OpenCV does not have modules for filtering string output – nathancy Oct 24 '19 at 19:56

1 Answers1

2

You can extract text from left-to-right and top-to-bottom using --psm 6 which tells Pytesseract to assume a single uniform block of text. Preprocessing is also important so we threshold to obtain a binary image with the desired foreground text in black and the background in white. Look here for other Pytesseract configuration options. After thresholding, here's the image we throw into Pytesseract

enter image description here

Here's the output

Limit Balance
Sep 29, 2015 $17,750.0 Oct 01, 2018 $0.00 Oct 02, 2018
0
Account Condition: Paid account/zero Account #: Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance 4636676005495602 Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Credit Card Account Term: REV
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 0 0 0
2017 0 0 0 0 0 0 0 0 0 0 0 0
2018 0 0 0 0 0 0 0 0 0 B
> BMW FINANCIAL SERVICES /
2602980
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Sep 19, 2015 $27,189.00 Jul01, 2017 $0.00 Jul 21, 2017 Jul 24, 2017
Account Condition: Paid account/zero Account #: 4002206279 Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Auto Lease Account Term: 036
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2015 Cc Cc Cc Cc
2016 Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc
2017 Cc Cc Cc Cc Cc Cc B
> LEXUS FINANCIAL SERVIC /
1624210
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Mar 07, 2015 $40,342.00 Jul01, 2016 $0.00 Jul 05, 2016 Jul 31, 2016
Account Condition: Paid account/zero Account #: Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance 70403662535410001 Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Auto Loan Account Term: 072
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014
2015 Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc
2016 Cc Cc Cc Cc Cc Cc B
> AES/SUNTRUST BANK / 9997195
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Sep 19, 2008 $12,500.00 Apr 01, 2016 $0.00 Apr 21, 2016 Apr 30, 2016
Account Condition: Paid account/zero Account #: Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance 5046237209PA00001 Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Signer
standing
Account Type: Education Loan Account Term: 300
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 Cc Cc Cc Cc Cc Cc Cc Cc Cc
2015 Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc
2016 Cc Cc Cc B
> BARCLAYS BANK DELAWARE /
1223850
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Apr 04, 2013 $3,500.00 Apr 01, 2016 $0.00 Oct 06, 2014 Apr 05, 2016
Account Condition: Paid account/zero Account #: 000176863399109 Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Credit Card Account Term: REV
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 Cc Cc Cc Cc Cc Cc Cc Cc 0
2015 0 0 0 0 0 0 0 0 0 0 0 0
2016 0 0 0 B
> AMERICAN HONDA FINANCE /
1605190
import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image = cv2.imread('1.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')

print(data)
nathancy
  • 42,661
  • 14
  • 115
  • 137
  • thanks alot for your guidance on page segmentation. appreciate. – xpertdev Oct 24 '19 at 14:05
  • Appreciate. But if you can check the first portion of text, below "Account Type: Credit Card Account Term: REV" ... In month section in 2016 line. 0 0 0 is showing under Jan Feb Mar in output. But These 0 0 0 values are actually belongs to Oct Nov Dec. same issue for below portions in month sections. how i will be able to control these values . so it will show under their own column like Oct/0 Nov/0 Dec/0. Please check the image portion above.. – xpertdev Oct 24 '19 at 14:18
  • You will have to do filtering on the string output and organize it. Pytesseract will only attempt to read text and order it based on the configuration value you give it. It is only an approximation so you will have to control the output values yourself – nathancy Oct 24 '19 at 20:38