0

I am using pytesseract to read the content related to date/time. This works well when the content to be read is on the same line. However, in the following case, I am not even able to use OpenCV to identify the area containing information:

image

Can anyone help me find a way to read information like on images? Thanks everyone

Niaz Palak
  • 175
  • 1
  • 13
Pain
  • 1
  • 1

1 Answers1

0

PDF is an unstructured format. You can not get structured data from a PDF. However, to extract information from a PDF you can use some tricks like if your data is italic or bold you can follow this link to extract them. But if your date/time format is not like that you can use regex to extract your date/time data. But if you have multiple date/time data and you want to store them for different purposes, you can use Words before and after those date/times to differentiate them. But the point is you can not get the position in PDF.

Niaz Palak
  • 175
  • 1
  • 13
  • Thanks for your answer. But with my description, I'm trying to use opencv and tesseract.Is there any way to do this? – Pain Sep 10 '19 at 06:53