0

I have a project where I have to highlight text in a structured PDF document and classify it so I can perform regex on multiple substrings and give their respective variables the proper values. Is there a way to have a PDF prompted to the screen where the user can highlight multiple parts and classify each of them automatically to a field that I can then use to create regular expressions without having to first extract the text from the pdf and then manually perform regexes on all the different substrings of interest?

Right now I'm using the pdfplumber library in python to extract text in PDFs line by line and append it to a string so that I can perform regex on it.

I would like to be able to just highlight multiple lines of text in the pdf file each and classify each of them individually so that I can send them as arguments to whichever regular expression library I'm using automatically and get multiple regular expressions and or one regular expression in return?

PeterQuando
  • 75
  • 1
  • 7

1 Answers1

-1

Highlight text in a PDF with Python

These might help: https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052

https://www.thepythoncode.com/article/redact-and-highlight-text-in-pdf-with-python

For the GUI you could use GTK: https://python-gtk-3-tutorial.readthedocs.io/en/latest/textview.html

  • I know these tools. But they don't solve my problem I need to automatically map highlighted text in a pdf file to it's string output. – Maurice Bekambo Jan 22 '22 at 16:42