extracting persian/farsi text from image in python

Question

i have a problem with extracting persian text from image in python. this is my code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
img = cv2.imread('kkk.jpg')
text = pytesseract.image_to_string(img)
read = text.encode('utf-8')
print(read)

and the output...

b'ae Ose\nDs EO gr ga HHO seme Oe TY ID OAM mE GRA CC 0\noP st joa nt RE pene op A er pr ee <2\nSr Bho OF 0 6M er erg AE Ce AD or70 e100 ETT NAO =P 16\ney AO ER EO a ome Shes 1g ee\nwido Aire 8 Ore sO er Cory |? re pre CO ee\nAD LH pre Dae heyy PACD He sy oy oe OO sie Aion\xe2\x80\x9d\nDB ep SD Sop 6 FD Hoy (CAD ASS ep gets (Se EO ET\nay gt OCs CO map) ramnee (FETE SIO wre a OTe IO\nry pO Se Forge ye ag PLEA HG gelg0 9 erie re OG\nCP nO ser Fine? LA Peter (9007 \xe2\x80\x98SHG \xc2\xa5 [BIS) SI SY\nifs pre A (JK So ey pe? g1005 [HO 2 IC QC\nqe ID Sr VET pra? Aire Marto ery mrewy rem geod \xc2\xa9\n\nseq, PF em ae'

You’re going to have to add an example image to your question if you want to get any concrete help. Did you try specifying persian language - see here from some examples of how to specify localization https://nanonets.com/blog/ocr-with-tesseract/ — DisappointedByUnaccountableMod, Jul 08 '20 at 23:02
Also see https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy#10034214 and that links to https://tesseract-ocr.github.io/tessdoc/ImproveQuality — DisappointedByUnaccountableMod, Jul 08 '20 at 23:05
Thank u for replying, i'll try tomorrow, it should solve my problem probably :) — SerioUs, Jul 08 '20 at 23:50

score 2 · Answer 1 · answered Nov 03 '20 at 07:54

2

you must download trained Farsi data for Tesseract from tesseract and put it in your Tesseract installation path, in data folder. Then use the following code:

text = pytesseract.image_to_string(img, lang='fas')

answered Nov 03 '20 at 07:54

Ehsan Akbaritabar

501
4
11

Failed loading language \'fas\' Tesseract couldn\'t load any languages! Could not initialize tesseract.') – ashkan Feb 28 '23 at 07:32

extracting persian/farsi text from image in python

1 Answers1