How to remove bad characters or special character in opencv python and improve OCR accuracy?

Question

I have built a program for extract text in image in python and OCR, but when i run the code I get some bad characters and its accuracy is not good , but it works. Can I add some datasetes about the characters that should be processed? How can I solve the problems?

This is my image :

And this is the code :

import cv2
import numpy as np
import pytesseract

# Read input image, convert to grayscale
img = cv2.imread('9.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Remove shadows, cf. https://stackoverflow.com/a/44752405/11089932
dilated_img = cv2.dilate(gray, np.ones((7, 7), np.uint8))
bg_img = cv2.medianBlur(dilated_img, 21)
diff_img = 255 - cv2.absdiff(gray, bg_img)
norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255,
                         norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1)

# Threshold using Otsu's
work_img = cv2.threshold(norm_img, 0, 255, cv2.THRESH_OTSU)[1]

# Tesseract
custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(work_img, config=custom_config)
print(text)

And finally this is the output :

fe
|Urine Analysis
| Urine analysis
| Color Yellow RBC/hpf 4-6
| Appereance Turbid WBC/hpf 2-3
; Specific Gravity 1014 Epithelial cells/Lpf 1-2
PH 7 Bacteria (Few)
| Protein Pos(+) Casts Pos(+)
Glucose Negative Mucous (Few)
Keton. Negative
Blood Pos(+)
Bilirubin Negative
' Urobilinogen Negative
| Nitrite Pos(+)

Malik Hamza · Answer 1 · 2021-08-25T14:59:52.287

0

I had the similar web. I was trying to extract some information from the image but I was getting other raw text as well. So what you do is you can try an algorithm to extract only desired data.

Here is my image as input like yours Input image

Now this algorithm or code is extracting only IDs or Registration numbers of students.

Regs_No = list(new)
regs_no = []
count =0
Status = []
#Extracting Only Registration Number
for i in range(len(Regs_No)):
    if new[i][1:6] == "8MDSW":
       regs_no.append(new[i])
       Status.append('P')

So the above code is only extracting registration number.

In you case you can also use some code to get only desired text. Hope it works. Thanks.

edited Aug 25 '21 at 14:59

answered May 17 '21 at 12:35

Malik Hamza

181
1
10

Thanks a lot , Do you know how can I add a dataset to make better the accuracy of process? – FATEGH May 17 '21 at 13:19
1

Dataset won't help you. You should take more clear picture in order to get more accuracy. Picture should be centered and clear. In this way it can help you. – Malik Hamza May 17 '21 at 15:32
Got it. Thanks. – Malik Hamza Aug 25 '21 at 14:59

How to remove bad characters or special character in opencv python and improve OCR accuracy?

1 Answers1