I have been extracting text from PDFs using pdftotext. I have also done this with Ghostscript. Recently, a utility provider changed their PDFs so a portion of it is not being extracted by these methods. Specifically, I'm missing the due date and total due. When I open the PDF in a reader, the 'missing' text can be highlighted, copied, and pasted into an external editor. When I open it in Acrobat Pro, and view the content (View -> Show/Hide -> Navigation Panes -> Content), the text I need is there. How can I get it out without manually copying and pasting? (which is not an option, because I'll be doing this on thousands of PDFs)?
Here an example of what I'm dealing with. I have removed all sensitive data:
EDIT: I noticed after posting this that when you follow the link to the file (hosted on Google Drive), it will allow you to select and copy most text on page, but not the stuff I'm missing. When you download the file, you are able to select the missing text in a PDF reader.