Tesseract-OCR text extraction

Question

I am being trying to OCR an ID image from tesseract cause I am new to this field I don't know much about image preprocessing. But soo far I have done this but not getting good output.

this is the original image

this is the code i tried so far.

img = cv.imread('id/ID (6).jpg',0)
smooth = cv.GaussianBlur(img, None, sigmaX=30, sigmaY=30)
division = cv.divide(img, smooth, scale=255)
ret,thresh3 = cv.threshold(division,220,150,cv.THRESH_TRUNC)
adaptive = cv.adaptiveThreshold(thresh3, 255, cv.ADAPTIVE_THRESH_GAUSSIAN_C, cv.THRESH_BINARY, 11,2 )
kernel = cv.getStructuringElement(cv.MORPH_RECT, (3,3))
morpho_e = cv.morphologyEx(adaptive,cv.MORPH_ERODE,kernel,iterations=1)

this is the output image I'm getting

For tesseract-OCR

py.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
config_param = r'--oem 1 -l ell --psm 6+11'
string = py.image_to_string(morpho_e,config=config_param)
print(string)

OUTPUT (text im getting):

ΣΤΟΙΧΕΙΑ ΤΑΥΤΟΤΗΤΑΣ Ξ
ΑΙΘΟΞΟΠΟΥΛΩΣ ο.
[θηδίθροικος αδΛ
ἀδίόοςν ο φδ
ΤΗΡΟΡΟΒΟΣ
ΩΟΙνΕΝ ΝΗΣ ]
ΙΟΑΝΗΜΙς
-ςτ-τ--.
ἈθδΟΗ Λο α΄
ὀπο ον
ΜΑΡΙΑ
ΜΑΤΘ ΜΗΤΕΗΝ γ
ΘΕΣΣΑΛΟΝΙΚΗ ΘΕ ΠΑΤΕ ΟΝϊΚΗΣ δἳ
τοΠοςΕΓΕΕΗΗΗΣΗς..ὸ.ΎΥνς-
τὸ ΝΡῑ
ΘΕΣΣΑΛΟΝΙΚΗΣ 177992/9 |
ΔΑ. ΤΟΥΜΠΑΣ -ΤΡΙΑΝΔΡΙΑΣ ''

- " ων Ξη.. , : ν. :
ΔΑΝΑΗ ή αυρούλα
ἠ Οτο ΥπΟΞΣ ΑΟἳ ϱ Δεῖγὲ ! )

δ” “'.... ν΄

ψἂ (ΥΠΟΓΡΑΦΗ - ΣΦΡΑΓΙΔΑ)

αν... ο, ) Ρς ο  πς, εν

kindly some one help me or give some guide to tackle this problem

Assuming those blue words are not necessary, I think you should try to erase them from your image because TesseractOCR tries to process everything it sees. This post might help https://stackoverflow.com/questions/42592234/python-opencv-morphologyex-remove-specific-color — Val77, Jan 16 '23 at 08:07
thanks for reply i have tried that too but result still no good @Val77 — Muhammad Hamza, Jan 16 '23 at 13:47

Tesseract-OCR text extraction

0 Answers0