Converting Id Card Image to text by using Pytesseract. Till yet I've break the image in section for name address Id card number and parse it using
import pytesseract as tess
from PIL import Image
im = Image.open("Image.jpg")
crop_rectangle = (20, 320, 400, 400)
cropped_im = im.crop(crop_rectangle)
text = tess.image_to_string(cropped_im, lang='ara')
print(text)
The result is blank.
In additional I've also tried
text = tess.image_to_pdf_or_hocr(cropped_im, lang='ara', extension='hocr')
And this addition step returns
b'<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name=\'ocr-system\' content=\'tesseract v5.0.0-alpha.20191030\' />
<meta name=\'ocr-capabilities\' content=\'ocr_page ocr_carea ocr_par ocr_line ocrx_word
ocrp_wconf\'/>
</head>
<body>
<div class=\'ocr_page\' id=\'page_1\' title=\'image
"C:\\Users\\MOHSIN~1.IFT\\AppData\\Local\\Temp\\tess_za20zk94.PNG"; bbox 0 0 380 80; ppageno 0\'>
<div class=\'ocr_carea\' id=\'block_1_1\' title="bbox 0 0 380 80">
<p class=\'ocr_par\' id=\'par_1_1\' lang=\'ara\' title="bbox 0 0 380 80">
<span class=\'ocr_line\' id=\'line_1_1\' title="bbox 0 0 380 80; baseline 0 0; x_size 108;
x_descenders 27; x_ascenders 27">
<span class=\'ocrx_word\' id=\'word_1_1\' title=\'bbox 0 0 380 80; x_wconf 95\'> </span>
</span>
</p>
</div>
</div>
</body>
</html>'
Need help to convert Urdu/Arabic Image into text Thank you in Advance