Have a look at the page segmentation modes of pytesseract
, cf. this Q&A. For example, using config='-psm 12'
will already give you all desired texts. Nevertheless, those graphs are also somehow interpreted as texts.
That's why I would preprocess the image to get single boxes (actual texts, the graphs, those information from the top, etc.), and filter to only store those boxes with the content of interest. That could be done by using
- the
y
coordinate of the bounding rectangle (not in the upper 5 % of the image, that's the mobile phone status bar),
- the width
w
of the bounding rectangle (not wider than 50 % of the image' width, these are the horizontal lines),
- the
x
coordinate of the bounding rectangle (not in middle third of the image, these are the graphs).
What's left is to run pytesseract
on each cropped image with config='-psm 6'
for example (assume a single uniform block of text), and clean the texts from any line breaks.
That'd be my code:
import cv2
import pytesseract
# Read image
img = cv2.imread('cUcby.png')
hi, wi = img.shape[:2]
# Convert to grayscale for tesseraact
img_grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Mask single boxes by thresholding and morphological closing in x diretion
mask = cv2.threshold(img_grey, 248, 255, cv2.THRESH_BINARY_INV)[1]
mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE,
cv2.getStructuringElement(cv2.MORPH_RECT, (51, 1)))
# Find contours w.r.t. the OpenCV version
cnts = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
# Get bounding rectangles
rects = [cv2.boundingRect(cnt) for cnt in cnts]
# Filter bounding rectangles:
# - not in the upper 5 % of the image (mobile phone status bar)
# - not wider than 50 % of the image' width (horizontal lines)
# - not being in the middle third of the image (graphs)
rects = [(x, y, w, h) for x, y, w, h in rects if
(y > 0.05 * hi) and
(w <= 0.5 * wi) and
((x < 0.3333 * wi) or (x > 0.6666 * wi))]
# Sort bounding rectangles first by y coordinate, then by x coordinate
rects = sorted(rects, key=lambda x: (x[1], x[0]))
# Get texts from bounding rectangles from pytesseract
texts = [pytesseract.image_to_string(
img_grey[y-1:y+h+1, x-1:x+w+1], config='-psm 6') for x, y, w, h in rects]
# Remove line breaks
texts = [text.replace('\n', '') for text in texts]
# Output
print(texts)
And, that's the output:
['Investing', '$9,712.99', 'ASRT', '-27.64%', '500.46 shares', 'GNUS', '-27.98%', '251.69 shares']
Since you have the locations of the bounding rectangles, you could also re-arrange the whole text using that information.
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.16299-SP0
Python: 3.9.1
PyCharm: 2021.1.1
OpenCV: 4.5.1
pytesseract: 4.00.00alpha
----------------------------------------