Use pytesseract to read content that is out of position

Question

I am using pytesseract to read the content related to date/time. This works well when the content to be read is on the same line. However, in the following case, I am not even able to use OpenCV to identify the area containing information:

image

Can anyone help me find a way to read information like on images? Thanks everyone

Right. However, I am not allowed to use the PDF reader tools available on python — Pain, Sep 10 '19 at 03:21
And you are not getting the data in a fixed structured format. right? — Niaz Palak, Sep 10 '19 at 03:22
Right. At the moment, I just have a problem with the structure as shown in my image — Pain, Sep 10 '19 at 03:25
as my description above, I'm trying to use opencv and tesseract to do this — Pain, Sep 10 '19 at 06:25

score 0 · Answer 1 · answered Sep 10 '19 at 03:33

PDF is an unstructured format. You can not get structured data from a PDF. However, to extract information from a PDF you can use some tricks like if your data is italic or bold you can follow this link to extract them. But if your date/time format is not like that you can use regex to extract your date/time data. But if you have multiple date/time data and you want to store them for different purposes, you can use Words before and after those date/times to differentiate them. But the point is you can not get the position in PDF.

Thanks for your answer. But with my description, I'm trying to use opencv and tesseract.Is there any way to do this? — Pain, Sep 10 '19 at 06:53

Use pytesseract to read content that is out of position

1 Answers1