1

i have a problem with extracting persian text from image in python. this is my code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
img = cv2.imread('kkk.jpg')
text = pytesseract.image_to_string(img)
read = text.encode('utf-8')
print(read)

and the output...

b'ae Ose\nDs EO gr ga HHO seme Oe TY ID OAM mE GRA CC 0\noP st joa nt RE pene op A er pr ee <2\nSr Bho OF 0 6M er erg AE Ce AD or70 e100 ETT NAO =P 16\ney AO ER EO a ome Shes 1g ee\nwido Aire 8 Ore sO er Cory |? re pre CO ee\nAD LH pre Dae heyy PACD He sy oy oe OO sie Aion\xe2\x80\x9d\nDB ep SD Sop 6 FD Hoy (CAD ASS ep gets (Se EO ET\nay gt OCs CO map) ramnee (FETE SIO wre a OTe IO\nry pO Se Forge ye ag PLEA HG gelg0 9 erie re OG\nCP nO ser Fine? LA Peter (9007 \xe2\x80\x98SHG \xc2\xa5 [BIS) SI SY\nifs pre A (JK So ey pe? g1005 [HO 2 IC QC\nqe ID Sr VET pra? Aire Marto ery mrewy rem geod \xc2\xa9\n\nseq, PF em ae'
SerioUs
  • 308
  • 2
  • 9
  • 1
    You’re going to have to add an example image to your question if you want to get any concrete help. Did you try specifying persian language - see here from some examples of how to specify localization https://nanonets.com/blog/ocr-with-tesseract/ – DisappointedByUnaccountableMod Jul 08 '20 at 23:02
  • Also see https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy#10034214 and that links to https://tesseract-ocr.github.io/tessdoc/ImproveQuality – DisappointedByUnaccountableMod Jul 08 '20 at 23:05
  • Thank u for replying, i'll try tomorrow, it should solve my problem probably :) – SerioUs Jul 08 '20 at 23:50

1 Answers1

2

you must download trained Farsi data for Tesseract from tesseract and put it in your Tesseract installation path, in data folder. Then use the following code:

text = pytesseract.image_to_string(img, lang='fas')
  • Failed loading language \'fas\' Tesseract couldn\'t load any languages! Could not initialize tesseract.') – ashkan Feb 28 '23 at 07:32