Tesseract returning gibberish when performing OCR on image

Question

I'm trying to use Tesseract to read an image, but it returns gibberish. I know I need to do some pre-processing, but what I have found online doesn't seem to work with my image. I tried this answer to turn the picture from black background/white letters to white background/black letters without success.

This is the picture.

And my simple code:

from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'D:\Tesseract-OCR\tesseract'

img = Image.open("2020-01-25_17-57-49_UTC.jpg")
print(pytesseract.image_to_string(img))

score 2 · Answer 1 · answered Jan 28 '20 at 01:38

2

Cobbling code found here on SO

from PIL import Image
import PIL.ImageOps
import pytesseract

img = Image.open("8pjs0.jpg")
inverted_image = PIL.ImageOps.invert(img)
print(pytesseract.image_to_string(inverted_image))

gives me

Dolar Hoy en Cucuta

25-Enero-20
01:00PM

78.048
VENTA

I think you'll need some sort of language packs for the accented characters.

answered Jan 28 '20 at 01:38

Perhaps something in https://github.com/tesseract-ocr/tessdata can help with the accented characters. – Jan 28 '20 at 08:42

score 1 · Answer 2 · answered Jan 28 '20 at 01:38

A simple Otsu's threshold to obtain a binary image then an inversion to get the letters in black and the background in white seems to work. We use --psm 3 to tell Pytesseract to perform automatic page segmentation. Take a look at Pytesseract OCR multiple config options for more configuration options. Here's the preprocessed image

Result from Pytesseract OCR

Dolar Hoy en Cucuta

25-Enero-20
01:00PM

78.048
VENTA

Code

import cv2
import numpy as np
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, grayscale, threshold, invert
image = cv2.imread('1.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
result = 255 - thresh

# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(result, config='--psm 3')
print(data)

cv2.imshow('result', result)
cv2.waitKey()

Tesseract returning gibberish when performing OCR on image

2 Answers2