I've just started exploring pytesseract. Here's the issue I'm facing:
I have the following input image.
Now, when I try running OCR on this, I get the following output:
Thanks for signing up. Now you too can pick your favorite pillows
Option AB
After trying it out on multiple image samples, I can safely conclude the following:
- It is not a non-dict word penalty.
- It is omitting words which are short. Almost as though it has a min-width that is taking effect, and is veto-ing any line with width lesser than that.
- It only happens if the input image has the bounding rectangle around it. If I remove that from the input image I get the correct output.
i.e. on the following image:
I get the following output
Thanks for signing up. Now you too can pick your favorite pillows
Option
Opon
Option AB
I'm unable to figure out where am I going wrong. Here's the code I'm using:
from PIL import Image
import pytesseract
import argparse
import cv2
image = cv2.imread('testImage.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 240, 255, cv2.THRESH_BINARY)[1]
filename = "intermediate.png"
cv2.imwrite(filename, gray)
im = Image.open(filename)
text = pytesseract.image_to_string(im)
print(text)
I've tried tinkering around with some of the config parameters (crunch_del_min_width, language_model_min_compound_length, and a few others) too, but nothing helped.