1

Hi I am extracting some text from resume using coordinates . After the text extraction from OCR Pytesseract, an arrow is popping up after each time I write the text into a txt file

This is my code

import cv2

import numpy as np

import pytesseract
import threading

image = cv2.imread(r'C:\Users\Ramesh\Desktop\Parsing_Project\Resumes_jpg\Akhil\Akhil.jpg')
image = cv2.resize(image,(800,740))

kernel = np.array([[-1,-1,-1], 
                   [-1, 9,-1],
                   [-1,-1,-1]])

sharpened = cv2.filter2D(image, -1, kernel)

f = open(r'C:\Users\Ramesh\Desktop\Parsing_Project\result_text.txt', "a")

def designation(image):
    
    designation_cropped = image[65
                                :90, 290:600]
    text = pytesseract.image_to_string(designation_cropped).replace(',', ' ')
    print(text)
    f.write(text + '\n' )

def skills(image):
    skills_cropped = sharpened[110:210, 10:220]
    text = pytesseract.image_to_string(skills_cropped).replace(',', ' ')
    print(text)
    f.write(text + '\n' )
    f.close()

threading.Thread(target =designation(image)).start()
threading.Thread(target =skills(image)).start()

This is the snippet of the text extracted. See that an arrow is coming after each time I write text to the txt file

This is the result file

I want to get rid of the arrow sign . Could someone help me out ?

  • Does this helps: https://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for? – Pygirl Jan 25 '21 at 06:57
  • I am actually trying to remove the arrow sign captured in my text(as shown by the screenshot attached in the question) ,Which was not there in the image where I extracted the text – Rahul Ramesh Jan 25 '21 at 06:57
  • Sorry I didn't read the question properly. You need to limit the charcters for pytesseract. – Pygirl Jan 25 '21 at 06:58
  • 1
    I got the solution @Pygirl by using ``` text.replace('\f','') ``` – Rahul Ramesh Jan 25 '21 at 07:04

2 Answers2

1

Hi all I found the solution

I used

text.replace('\f','')

This removed the arrow that was getting captured in my result

0

That arrow represents the page separator of the output text.

You can set the page separator to an empty string in tesseract with the below configuration.

 -c page_separator=""

In your case:

text = pytesseract.image_to_string(designation_cropped, config='-c page_separator=""').replace(',', ' ')

By using this your text will not have a page separator.

skaveesh
  • 369
  • 6
  • 9