0

I have a huge number of JPEG images which are in high resolution (2500 x 3500 pixels) and are roughly in this shape:

enter image description here

Each of the numbers designate a separate record and my aim is to convert these to text.

I am aware of various OCR solutions such OpenCV or Tesseract, but my problem is in detecting the boundary of each record (so that later on, feed each one to the OCR). How can I achieve something like this:

enter image description here

HansHirse
  • 18,010
  • 10
  • 38
  • 67
wiki
  • 1,877
  • 2
  • 31
  • 47
  • Look at [answer](https://stackoverflow.com/a/59977588/4267439). Or maybe you can feed everything to the OCR and then separate the records using the numbers and pipes with a regex. – rok Mar 04 '21 at 09:18
  • @rok thanks man; I check both options. – wiki Mar 04 '21 at 09:21
  • Does every record start with a blue number? Threshold on blue and do some morphological closing, to get "blue boxes". From that, create actual boundaries from the top of each "blue box" to the top of the next "blue box" (+/- a few pixels to the top or bottom), and incorporating the whole width. – HansHirse Mar 04 '21 at 10:04
  • @HansHirse Yes, every record start with a blue number and a blue pipe symbol. I was thinking of the same strategy as you've suggested but did not know how to implement. Thanks. – wiki Mar 04 '21 at 11:11

1 Answers1

0

Since every record starts with a blue number, you can threshold on blue-ish colors using the HSV color space to mask these texts. On that mask, use morphological closing, to get "boxes" from these blue texts. From that modified mask, find the contours, and determine their upper y coordinate. Extract the single records from the original image by slicing from one y coordinate to the next (+/- a few pixels) and using the full image width.

Here's some code for that:

import cv2
import numpy as np

# Read image
img = cv2.imread('CfOBO.png')

# Thresholding blue-ish colors using HSV color space
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
blue_lower = (90, 128, 64)
blue_upper = (135, 255, 192)
blue_mask = cv2.inRange(hsv, blue_lower, blue_upper)

# Morphological closing
blue_mask = cv2.morphologyEx(blue_mask, cv2.MORPH_CLOSE, np.ones((11, 11)))

# Find contours w.r.t. the OpenCV version
cnts = cv2.findContours(blue_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# Get y coordinate from bounding rectangle for each contour
y = sorted([cv2.boundingRect(cnt)[1] for cnt in cnts])

# Manually add end of last record
y.append(img.shape[0])

# Extract records
records = [img[y[i]-5:y[i+1]-5, ...] for i in range(len(cnts))]

# Show records
for record in records:
    cv2.imshow('Record', record)
    cv2.waitKey(0)
cv2.destroyAllWindows()

There's plenty of room for optimization, e.g. if the last record has some large white space following. I just added the image bottom for the lower end of the last record. But, the general workflow should do what's desired. (I left out the following pytesseract stuff.)

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.16299-SP0
Python:        3.9.1
NumPy:         1.20.1
OpenCV:        4.5.1
----------------------------------------
HansHirse
  • 18,010
  • 10
  • 38
  • 67