I have a project where I have to highlight text in a structured PDF document and classify it so I can perform regex on multiple substrings and give their respective variables the proper values. Is there a way to have a PDF prompted to the screen where the user can highlight multiple parts and classify each of them automatically to a field that I can then use to create regular expressions without having to first extract the text from the pdf and then manually perform regexes on all the different substrings of interest?
Right now I'm using the pdfplumber library in python to extract text in PDFs line by line and append it to a string so that I can perform regex on it.
I would like to be able to just highlight multiple lines of text in the pdf file each and classify each of them individually so that I can send them as arguments to whichever regular expression library I'm using automatically and get multiple regular expressions and or one regular expression in return?