0

I have a code that highlights the user's name from an image, I want to extract text i.e users name from that image. Below is the code

import matplotlib.pyplot as plt
import cv2
import easyocr
from pylab import rcParams
from IPython.display import Image
rcParams['figure.figsize'] = 8, 16
reader = easyocr.Reader(['en'])
output = reader.readtext('MP-SAMPLE1.jpg')
cord = output[-106][0]
x_min, y_min = [int(min(idx)) for idx in zip(*cord)]
x_max, y_max = [int(max(idx)) for idx in zip(*cord)]

image = cv2.imread('MP-SAMPLE1.jpg')
cv2.rectangle(image,(x_min,y_min),(x_max,y_max),(0,0,255),2)
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

I have set coordinates according to my image, you can adjust it according to yours, I need to extract the text which is highlighted under the rectangular box. I am new in this field please ignore any mistake I would have done.

enter image description here

enter image description here

Nikhil Bansal
  • 163
  • 3
  • 16

1 Answers1

0

Here is my partial-solution for the problem.

Since you are beginner, let me give you an advice, always start with pre-processing.

Pre-processing will help you to remove the unwanted-artifacts.

For instance you can do thresholding: Thresholding-result

or median filtering: Median-filter result

I used thresholding, then you can use pytesseract library. The library contains a lot of configuration options.

Also for non-english languages, you can follow this tutorial.

So, you want the text next to the FATHERS HUSBANDS. Therefore we could do

    1. Convert image to the text.

      • text = pytesseract.image_to_string(Image.open(f_name), lang='eng')
        
    1. From the text, find the equivalent of FATHERS HUSBANDS

      • for line in text.split('\n'):
            if "FATHERS HUSBANDS" in line:
                name = line.split('.')[1].split(',')[0]
                print(name)
        
      • Result:

        • GRAMONAN GROVER
          

The last name is correct but first name is partially correct, it should be BRAJMONAN.

I wrote this answer, hoping you to gude to your solution. Good luck.

Code:


import os
import cv2
import pytesseract

from PIL import Image

img = cv2.imread("FXSCh.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# threshold
gry = cv2.threshold(gry, 0, 255,
                    cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

f_name = "{}.png".format(os.getpid())
cv2.imwrite(f_name, gry)

text = pytesseract.image_to_string(Image.open(f_name), lang='eng')

for line in text.split('\n'):
    if "FATHERS HUSBANDS" in line:
        name = line.split('.')[1].split(',')[0]
        print(name)

os.remove(f_name)

cv2.imshow("Image", img)
cv2.imshow("Output", gry)
cv2.waitKey(0)
Ahmet
  • 7,527
  • 3
  • 23
  • 47
  • hello thank you for helping me out, the solution works well with a specific use case, but I want to extract multiple details of a student & that too from multiple documents, now when I fed this solution a similar format mark sheet of another student it started giving me -1 as output & also when I try to extract students name the same way you extracted father's name, it gave me -1 output – Nikhil Bansal Nov 12 '20 at 08:59
  • any idea if there is a reliable way to detect text that has been highlighted physically in printed documents and digitally in PDFs? – oldboy Oct 24 '21 at 21:20
  • @oldboy I'm not sure, but I think you may use [inRange thresholding](https://docs.opencv.org/4.5.4/da/d97/tutorial_threshold_inRange.html). – Ahmet Oct 26 '21 at 18:01
  • awesome ill check it out. thanks – oldboy Oct 27 '21 at 19:20