Extract text from image using OCR in python

Question

I want to extract text from a specific area of the image like the name and ID number from identity card. The ID card from which I want to extract text is in the Chinese language(Chinese ID card). I have tried this code but it just extracts the address and date of birth which I don't need. I just need the name and ID number.

import cv2
from PIL import Image
import pytesseract
import argparse
import os

image = cv2.imread("E:/face.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename,gray)

text = pytesseract.image_to_string(Image.open(filename), lang='chi_sim')
print(text)
os.remove(filename)

I have also attached the image from which I am trying to extract text. I have tried according to my knowledge but not succeeded.any help and guidance would be appreciated.

Show us the error instead.showing the error would help people here to give solution. If you don't have any idea how to proceed for the problem look for another tutorials. — Krishna, Jul 11 '18 at 05:01
@DevashishPrasad yes i am getting this output from my code (出生 1991年7月14日住址上濂市宝山区渭`鳙七村鹏号5o3雹) — Tehseen, Jul 11 '18 at 05:14
@krishna i am asking for help. my existing code doesn't give me my desired results so i ask for help here — Tehseen, Jul 11 '18 at 05:16
@Tehseen Can you attach the binary image as well? If there is any information loss in binary image itself, then it wont recognize the characters. — ZdaR, Jul 11 '18 at 05:16
@ZdaR i have attached the binary image. there is some data lost in that binary image. can you help to improve that binary image? — Tehseen, Jul 11 '18 at 05:23
@Tehseen can you please locate the region where name and ID number is present on the card. I cannot understand Chinese so i m unable to figure out where is ID number and name — Devashish Prasad, Jul 11 '18 at 05:24
@Tehseen you can improve that binary image by not using thresholding. First make it gray as you did, then use gaussian blur (5,5), then use canny edge detection and then finally dialate and erode you image. It makes characters more visible. But i dont know why my tesseract prints '?' even if i m using chi-shim language — Devashish Prasad, Jul 11 '18 at 05:28
@DevashishPrasad the first one on the left top corner of image is the name and the last one on the bottom of image is the id number.. "310109199107141011" is the id number — Tehseen, Jul 11 '18 at 05:29
@DevashishPrasad ok i will try these and let's see if it improves the result. i will let you know. thanks — Tehseen, Jul 11 '18 at 05:32
@DevashishPrasad should i apply dilation or erosion? i mean i have to apply either dilate or erode on the resultant image after canny edge detection right? kindly guide me if i am missing anything — Tehseen, Jul 11 '18 at 09:19
@Tehseen We generally first apply dilation and then erosion but it completely depends on you. All matters is image with clean edges. Also apply inverted threshold after edge detection as it improves performance of tesseract — Devashish Prasad, Jul 12 '18 at 11:53
@DevashishPrasad i have extracted the text but now i want to extract the first line on the image which is the name and also the last line on the bottom of the image which is the ID number. can you guide me about how to target some specific area of image to extract only the desired text. — Tehseen, Jul 13 '18 at 02:28

score 7 · Answer 1 · answered Jul 11 '18 at 07:47

7

I can suggest a pre-processing step prior to finding textual information. The code is simple to comprehend.

Code:

image = cv2.imread(r'C:\Users\Jackson\Desktop\face.jpg')

#--- dilation on the green channel ---
dilated_img = cv2.dilate(image[:,:,1], np.ones((7, 7), np.uint8))
bg_img = cv2.medianBlur(dilated_img, 21)

#--- finding absolute difference to preserve edges ---
diff_img = 255 - cv2.absdiff(image[:,:,1], bg_img)

#--- normalizing between 0 to 255 ---
norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1)
cv2.imshow('norm_img', cv2.resize(norm_img, (0, 0), fx = 0.5, fy = 0.5))

#--- Otsu threshold ---
th = cv2.threshold(norm_img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
cv2.imshow('th', cv2.resize(th, (0, 0), fx = 0.5, fy = 0.5))

Use it and let me know if you are able to find the relevant textual information!

answered Jul 11 '18 at 07:47

Jeru Luke

20,118
13
80
87

i have used your code and i am able to extract the name on the image which is on the first line but still it doesn't extract the ID number which is on the last line of the card. it's very clear on the image but i don't know why it doesn't extract that..this is the output i am getting from this code "姓名` 费家杰…翼叠沣瓢男二黾族汉 _ …′^出`…生`〉翼叠g肝勇7 月斓亘住址上诲市宝山区泗塘七村93 '号503室"′ ′′二" – Tehseen Jul 12 '18 at 03:29
i have converted original image into gray scale and then applied dilation on that gray image and then find absolute difference and now the results are a bit improved. now i am getting the ID number but it's not satisfactory.. this is the output "性别男〈 “ =) 黾族汉… ` _ _′ .…′′z′′ 「出生`′ 「叠g′丐菩荠二]7′_眉菩卒垂′暮′日` 「` 住址上诲市宝山区泗塘七村腋号503菖] ′…`】 …` `_ ′ ′` 毛 ′ 公民身份号码 '′′"31b『D9i991o蓁141011" – Tehseen Jul 12 '18 at 03:50
1

@Tehseen I think you have tweak the dilation parameters a bit more, like the type of kernel used and the size of the kernel. Or also try performing a median blur to remove the unwanted smaller spots (be careful while choosing the kernel size as well) – Jeru Luke Jul 12 '18 at 07:28
1

i have updated the code for dilation like this "dilated_img = cv2.dilate(gray, np.ones((5, 5), np.uint8))" and "bg_img = cv2.medianBlur(dilated_img, 23)" now it's better but still something at the first line and also i just want to extract the name which the first line and the ID number which is the last line. this is the output i am getting now. 姓名费家加 __ 「`′' 性名u ′男… ' 民族汉 __ 出生 199壕年~7月童4日住址上海市宝山区泗塘七村93 乙工乙道 ′ 公民身份号码 310109199107141011.. can you guide me how to target specific area to extract only the name and ID number? – Tehseen Jul 12 '18 at 08:40

score 0 · Answer 2 · answered Jun 24 '19 at 04:20

0

In pytesseract, lang = 'chi_sim' tries to interpret the digits also as Chinese characters. Use lang = 'eng' to get the numbers ocr'ed properly

answered Jun 24 '19 at 04:20

SRK

53
5

Extract text from image using OCR in python

2 Answers2