Mapping highlighted text in a pdf document to a character index range in it's .txt output

Question

I have a project where I have to highlight text in a structured PDF document and classify it so I can perform regex on multiple substrings and give their respective variables the proper values. Is there a way to have a PDF prompted to the screen where the user can highlight multiple parts and classify each of them automatically to a field that I can then use to create regular expressions without having to first extract the text from the pdf and then manually perform regexes on all the different substrings of interest?

Right now I'm using the pdfplumber library in python to extract text in PDFs line by line and append it to a string so that I can perform regex on it.

I would like to be able to just highlight multiple lines of text in the pdf file each and classify each of them individually so that I can send them as arguments to whichever regular expression library I'm using automatically and get multiple regular expressions and or one regular expression in return?

BillaBong Jr. · Answer 1 · 2022-01-22T15:42:32.490

-1

Highlight text in a PDF with Python

These might help: https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052

https://www.thepythoncode.com/article/redact-and-highlight-text-in-pdf-with-python

For the GUI you could use GTK: https://python-gtk-3-tutorial.readthedocs.io/en/latest/textview.html

edited Jan 22 '22 at 15:42

answered Jan 22 '22 at 15:25

BillaBong Jr.

25
5

I know these tools. But they don't solve my problem I need to automatically map highlighted text in a pdf file to it's string output. – Maurice Bekambo Jan 22 '22 at 16:42

Mapping highlighted text in a pdf document to a character index range in it's .txt output

1 Answers1