Pytesseract Improve OCR Accuracy

Question

I want to extract the text from an image in python. In order to do that, I have chosen pytesseract. When I tried extracting the text from the image, the results weren't satisfactory. I also went through this and implemented all the techniques listed down. Yet, it doesn't seem to perform well.

Image:

Code:

import pytesseract
import cv2
import numpy as np

img = cv2.imread('D:\\wordsimg.png')

img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
txt = pytesseract.image_to_string(img ,lang = 'eng')

txt = txt[:-1]

txt = txt.replace('\n',' ')

print(txt)

Output:

t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was

Even 1 unwanted space could cost me a lot. I want the results to be 100% accurate. Any help would be appreciated. Thanks!

bfris · Accepted Answer · 2020-10-06T15:52:23.100

I changed resize from 1.2 to 2 and removed all preprocessing. I got good results with psm 11 and psm 12

import pytesseract
import cv2
import numpy as np

img = cv2.imread('wavy.png')

#  img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.resize(img, None, fx=2, fy=2)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
#  img = cv2.dilate(img, kernel, iterations=1)
#  img = cv2.erode(img, kernel, iterations=1)

#  img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

cv2.imwrite('thresh.png', img)

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
    
for psm in range(6,13+1):
    config = '--oem 3 --psm %d' % psm
    txt = pytesseract.image_to_string(img, config = config, lang='eng')
    print('psm ', psm, ':',txt)

The config = '--oem 3 --psm %d' % psm line uses the string interpolation (%) operator to replace %d with an integer (psm). I'm not exactly sure what oem does, but I've gotten in the habit of using it. More on psm at the end of this answer.

psm  11 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm  12 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm is short for page segmentation mode. I'm not exactly sure what the different modes are. You can get a feel for what the codes are from the descriptions. You can get the list from tesseract --help-psm

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

Cool! A small favor. Could u pls explain what psm is? And what does ```config = '--oem 3 --psm %d' % psm``` mean? — Sushil, Oct 06 '20 at 03:44
If u feel that my question is good and is well-framed, then pls consider upvoting my question. Thank you! — Sushil, Oct 09 '20 at 13:55
oem is engine mode, `--oem 3` runs it on defaults. If you run your tesseract executable with `--help` you can see a full list of options. [this page](https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data) was quite helpful too. — Edward Spencer, Apr 12 '22 at 16:51

Pytesseract Improve OCR Accuracy

1 Answers1

Linked