How to extract text from the highlighted text from an image

Question

I have a code that highlights the user's name from an image, I want to extract text i.e users name from that image. Below is the code

import matplotlib.pyplot as plt
import cv2
import easyocr
from pylab import rcParams
from IPython.display import Image
rcParams['figure.figsize'] = 8, 16
reader = easyocr.Reader(['en'])
output = reader.readtext('MP-SAMPLE1.jpg')
cord = output[-106][0]
x_min, y_min = [int(min(idx)) for idx in zip(*cord)]
x_max, y_max = [int(max(idx)) for idx in zip(*cord)]

image = cv2.imread('MP-SAMPLE1.jpg')
cv2.rectangle(image,(x_min,y_min),(x_max,y_max),(0,0,255),2)
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

I have set coordinates according to my image, you can adjust it according to yours, I need to extract the text which is highlighted under the rectangular box. I am new in this field please ignore any mistake I would have done.

If possible, could you please share the `MP-SAMPLE1.jpg` or whatever image you are working with us? — Ahmet, Nov 12 '20 at 07:36
I've looked at your problem, but couldn't solve it. The best result I get is `GRAMONAN GROVER` most probably this is because I don't have indian pytesseract data. — Ahmet, Nov 12 '20 at 08:16
were you able to extract it in text?...if yes can you please share the code? — Nikhil Bansal, Nov 12 '20 at 08:18

score 0 · Accepted Answer · answered Nov 12 '20 at 08:32

Here is my partial-solution for the problem.

Since you are beginner, let me give you an advice, always start with pre-processing.

Pre-processing will help you to remove the unwanted-artifacts.

For instance you can do thresholding: Thresholding-result

or median filtering: Median-filter result

I used thresholding, then you can use pytesseract library. The library contains a lot of configuration options.

Also for non-english languages, you can follow this tutorial.

So, you want the text next to the FATHERS HUSBANDS. Therefore we could do

Convert image to the text.

text = pytesseract.image_to_string(Image.open(f_name), lang='eng')

From the text, find the equivalent of FATHERS HUSBANDS

for line in text.split('\n'):
    if "FATHERS HUSBANDS" in line:
        name = line.split('.')[1].split(',')[0]
        print(name)

Result:
- ```
GRAMONAN GROVER
```

The last name is correct but first name is partially correct, it should be BRAJMONAN.

I wrote this answer, hoping you to gude to your solution. Good luck.

Code:

import os
import cv2
import pytesseract

from PIL import Image

img = cv2.imread("FXSCh.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# threshold
gry = cv2.threshold(gry, 0, 255,
                    cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

f_name = "{}.png".format(os.getpid())
cv2.imwrite(f_name, gry)

text = pytesseract.image_to_string(Image.open(f_name), lang='eng')

for line in text.split('\n'):
    if "FATHERS HUSBANDS" in line:
        name = line.split('.')[1].split(',')[0]
        print(name)

os.remove(f_name)

cv2.imshow("Image", img)
cv2.imshow("Output", gry)
cv2.waitKey(0)

hello thank you for helping me out, the solution works well with a specific use case, but I want to extract multiple details of a student & that too from multiple documents, now when I fed this solution a similar format mark sheet of another student it started giving me -1 as output & also when I try to extract students name the same way you extracted father's name, it gave me -1 output — Nikhil Bansal, Nov 12 '20 at 08:59
any idea if there is a reliable way to detect text that has been highlighted physically in printed documents and digitally in PDFs? — oldboy, Oct 24 '21 at 21:20
@oldboy I'm not sure, but I think you may use [inRange thresholding](https://docs.opencv.org/4.5.4/da/d97/tutorial_threshold_inRange.html). — Ahmet, Oct 26 '21 at 18:01

How to extract text from the highlighted text from an image

1 Answers1