I want to extract the text from an image in python
. In order to do that, I have chosen pytesseract
. When I tried extracting the text from the image, the results weren't satisfactory. I also went through this and implemented all the techniques listed down. Yet, it doesn't seem to perform well.
Image:
Code:
import pytesseract
import cv2
import numpy as np
img = cv2.imread('D:\\wordsimg.png')
img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
txt = pytesseract.image_to_string(img ,lang = 'eng')
txt = txt[:-1]
txt = txt.replace('\n',' ')
print(txt)
Output:
t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was
Even 1 unwanted space could cost me a lot. I want the results to be 100% accurate. Any help would be appreciated. Thanks!