7

I want to extract text from a specific area of the image like the name and ID number from identity card. The ID card from which I want to extract text is in the Chinese language(Chinese ID card). I have tried this code but it just extracts the address and date of birth which I don't need. I just need the name and ID number.

import cv2
from PIL import Image
import pytesseract
import argparse
import os

image = cv2.imread("E:/face.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename,gray)

text = pytesseract.image_to_string(Image.open(filename), lang='chi_sim')
print(text)
os.remove(filename)

I have also attached the image from which I am trying to extract text. I have tried according to my knowledge but not succeeded.any help and guidance would be appreciated.enter image description here

This is the binary image

Tehseen
  • 115
  • 2
  • 14
  • Are you getting ? as output from tesseract.... – Devashish Prasad Jul 11 '18 at 04:58
  • Show us the error instead.showing the error would help people here to give solution. If you don't have any idea how to proceed for the problem look for another tutorials. – Krishna Jul 11 '18 at 05:01
  • @DevashishPrasad yes i am getting this output from my code (出生 1991年7月14日 住 址 上濂市宝山区渭`鳙七村鹏 号5o3雹) – Tehseen Jul 11 '18 at 05:14
  • @krishna i am asking for help. my existing code doesn't give me my desired results so i ask for help here – Tehseen Jul 11 '18 at 05:16
  • @Tehseen Can you attach the binary image as well? If there is any information loss in binary image itself, then it wont recognize the characters. – ZdaR Jul 11 '18 at 05:16
  • @ZdaR i have attached the binary image. there is some data lost in that binary image. can you help to improve that binary image? – Tehseen Jul 11 '18 at 05:23
  • @Tehseen can you please locate the region where name and ID number is present on the card. I cannot understand Chinese so i m unable to figure out where is ID number and name – Devashish Prasad Jul 11 '18 at 05:24
  • @Tehseen you can improve that binary image by not using thresholding. First make it gray as you did, then use gaussian blur (5,5), then use canny edge detection and then finally dialate and erode you image. It makes characters more visible. But i dont know why my tesseract prints '?' even if i m using chi-shim language – Devashish Prasad Jul 11 '18 at 05:28
  • @DevashishPrasad the first one on the left top corner of image is the name and the last one on the bottom of image is the id number.. "310109199107141011" is the id number – Tehseen Jul 11 '18 at 05:29
  • @DevashishPrasad ok i will try these and let's see if it improves the result. i will let you know. thanks – Tehseen Jul 11 '18 at 05:32
  • @DevashishPrasad should i apply dilation or erosion? i mean i have to apply either dilate or erode on the resultant image after canny edge detection right? kindly guide me if i am missing anything – Tehseen Jul 11 '18 at 09:19
  • @Tehseen We generally first apply dilation and then erosion but it completely depends on you. All matters is image with clean edges. Also apply inverted threshold after edge detection as it improves performance of tesseract – Devashish Prasad Jul 12 '18 at 11:53
  • @DevashishPrasad i have extracted the text but now i want to extract the first line on the image which is the name and also the last line on the bottom of the image which is the ID number. can you guide me about how to target some specific area of image to extract only the desired text. – Tehseen Jul 13 '18 at 02:28
  • 我去~你真把身份证发到互联网上? – 钟智强 Jul 30 '23 at 08:19

2 Answers2

7

I can suggest a pre-processing step prior to finding textual information. The code is simple to comprehend.

Code:

image = cv2.imread(r'C:\Users\Jackson\Desktop\face.jpg')

#--- dilation on the green channel ---
dilated_img = cv2.dilate(image[:,:,1], np.ones((7, 7), np.uint8))
bg_img = cv2.medianBlur(dilated_img, 21)

#--- finding absolute difference to preserve edges ---
diff_img = 255 - cv2.absdiff(image[:,:,1], bg_img)

#--- normalizing between 0 to 255 ---
norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1)
cv2.imshow('norm_img', cv2.resize(norm_img, (0, 0), fx = 0.5, fy = 0.5))

enter image description here

#--- Otsu threshold ---
th = cv2.threshold(norm_img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
cv2.imshow('th', cv2.resize(th, (0, 0), fx = 0.5, fy = 0.5))

enter image description here

Use it and let me know if you are able to find the relevant textual information!

Jeru Luke
  • 20,118
  • 13
  • 80
  • 87
  • i have used your code and i am able to extract the name on the image which is on the first line but still it doesn't extract the ID number which is on the last line of the card. it's very clear on the image but i don't know why it doesn't extract that..this is the output i am getting from this code "姓名` 费家杰…翼 叠沣瓢 男二 黾族汉 _ …′^出`…生`〉 翼叠g肝勇7 月斓亘 住址 上诲市宝山区泗塘七村93 '号503室"′ ′′二" – Tehseen Jul 12 '18 at 03:29
  • i have converted original image into gray scale and then applied dilation on that gray image and then find absolute difference and now the results are a bit improved. now i am getting the ID number but it's not satisfactory.. this is the output "性别 男〈 “ =) 黾族汉… ` _ _′ .…′′z′′ 「出 生`′ 「叠g′丐菩荠二]7′_眉菩卒垂′暮′日` 「` 住 址 上诲市宝山区泗塘七村腋 号503菖] ′…`】 …` `_ ′ ′` 毛 ′ 公民身份号码 '′′"31b『D9i991o蓁141011" – Tehseen Jul 12 '18 at 03:50
  • 1
    @Tehseen I think you have tweak the dilation parameters a bit more, like the type of kernel used and the size of the kernel. Or also try performing a median blur to remove the unwanted smaller spots (be careful while choosing the kernel size as well) – Jeru Luke Jul 12 '18 at 07:28
  • 1
    i have updated the code for dilation like this "dilated_img = cv2.dilate(gray, np.ones((5, 5), np.uint8))" and "bg_img = cv2.medianBlur(dilated_img, 23)" now it's better but still something at the first line and also i just want to extract the name which the first line and the ID number which is the last line. this is the output i am getting now. 姓 名 费家加 __ 「`′' 性名u ′男… ' 民族汉 __ 出生 199壕年~7月童4日 住 址 上海市宝山区泗塘七村93 乙工乙道 ′ 公民身份号码 310109199107141011.. can you guide me how to target specific area to extract only the name and ID number? – Tehseen Jul 12 '18 at 08:40
0

In pytesseract, lang = 'chi_sim' tries to interpret the digits also as Chinese characters. Use lang = 'eng' to get the numbers ocr'ed properly

SRK
  • 53
  • 5