Pytesseract return nothing in Urdu and Arabic text

Question

Converting Id Card Image to text by using Pytesseract. Till yet I've break the image in section for name address Id card number and parse it using

import pytesseract as tess
from PIL import Image
im = Image.open("Image.jpg")
crop_rectangle = (20, 320, 400, 400)
cropped_im = im.crop(crop_rectangle)
text = tess.image_to_string(cropped_im, lang='ara')
print(text)

The result is blank.

In additional I've also tried text = tess.image_to_pdf_or_hocr(cropped_im, lang='ara', extension='hocr')

And this addition step returns

b'<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"                
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name=\'ocr-system\' content=\'tesseract v5.0.0-alpha.20191030\' />  
<meta name=\'ocr-capabilities\' content=\'ocr_page ocr_carea ocr_par ocr_line ocrx_word     
ocrp_wconf\'/>
</head>
<body>  
<div class=\'ocr_page\' id=\'page_1\' title=\'image     
"C:\\Users\\MOHSIN~1.IFT\\AppData\\Local\\Temp\\tess_za20zk94.PNG"; bbox 0 0 380 80; ppageno 0\'>  
<div class=\'ocr_carea\' id=\'block_1_1\' title="bbox 0 0 380 80">
<p class=\'ocr_par\' id=\'par_1_1\' lang=\'ara\' title="bbox 0 0 380 80">    
<span class=\'ocr_line\' id=\'line_1_1\' title="bbox 0 0 380 80; baseline 0 0; x_size 108;     
x_descenders 27; x_ascenders 27">     
<span class=\'ocrx_word\' id=\'word_1_1\' title=\'bbox 0 0 380 80; x_wconf 95\'> </span>    
</span>
</p>   
</div>
</div>
</body>
</html>'

Need help to convert Urdu/Arabic Image into text Thank you in Advance

Hey, @ProgSMI were you able to solve this problem , I am facing the same issue. — Touqeer Aslam, Mar 02 '21 at 09:49

Pytesseract return nothing in Urdu and Arabic text

0 Answers0