1

I'm trying to read a matrix from an image but I'm having problems with non-spaced numbers. I need to read line by line and make it into an array. I need to detect numbers true first.

Here is my code and output:

import cv2
import pytesseract

myString = ""

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

matris_image = cv2.imread('matris23.png')
matris_image = cv2.cvtColor(matris_image, cv2.COLOR_BGR2RGB)

matris = pytesseract.image_to_string(matris_image)

print(type(matris))
print(matris)
<class 'str'>
9135 2

1117 6

3 7 4 1

6 0 7 10


Process finished with exit code 0

I need output like this:

9 13 5 2
1 11 7 6
3 7 4 1
6 0 7 10

And here is the photo: Matrix test picture

Derek_P
  • 658
  • 8
  • 29
Arda Yasar
  • 66
  • 9
  • 2
    Maybe as you know it’s a matrix (but Tesseract doesn’t) you could split it up into 16 little rectangles and OCR them individually? – DisappointedByUnaccountableMod Oct 12 '21 at 18:58
  • Hi balmy, i just made it and it looks better. But do you know ways to increase the accuracy rate? – Arda Yasar Oct 12 '21 at 19:36
  • You only have to search for perhaps _tesseract improve recognition_ to find results like https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html and https://stackoverflow.com/questions/60624019/how-to-improve-ocr-with-pytesseract-text-recognition#60627069 - there are _many_ more. Binarizing to get rid of jpeg compression artifacts (or even better capturing using a lossless format like PNG) could help. – DisappointedByUnaccountableMod Oct 12 '21 at 21:10

1 Answers1

0

I wasn't able to get exactly what you were requesting, however I did get closer and you may be able to get better results from here. I removed the channel reversing cv2.cvtColor(matris_image, cv2.COLOR_BGR2RGB) and edited the image. The image was edited by cropping the noise from the corners and desaturating (make it black and white) the image.

enter image description here.

I had to add the blacklist for "." because pytesseract was returning "3.7" instead of "3 7".

img = cv2.imread('raw.png', 0)

matris = pytesseract.image_to_string(img, config= "-c tessedit_char_blacklist=.")

print(type(matris))
print(matris)

Gives:

<class 'str'>
9 13 5 2
1117 6
3 7 4 1
6 0 7 10
Derek_P
  • 658
  • 8
  • 29