Iterate over text objects/boxes/labels on a PDF file

Question

Is there a way to iterate over text objects (like text labels or text boxes) in a PDF file and get those objects properties (like x, y position on the page and the text itself)?

I need to get some specific pieces of text that are close, i.e. layed out in the page, to a particular label that may be in varied positions on the page. Sadly, when I simply extract the text I don't usually get a string that I can work with to get the value I need.

There is no such thing as "text label" or "Text boxes" except maybe in acroform PDFs. You could get the paths with the code here: https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions Maybe in tagged PDFs but not all PDFs are. — Tilman Hausherr, Mar 13 '19 at 09:22
I used terms such as "text label" or "text boxes" because I completely ignore the PDF standards specifics. I meant any method PDF uses to put text on the page. For instance, if I got it right, there are operands such as `Td` are used to position some text on the page. If PDFbox has any means to iterate over blocks of text that are positioned with, por instance, such operand, I would be able to get what I want. — Ramiro, Mar 13 '19 at 16:39
You could use PDFStreamParser. See the RemoveAllText.java example. However that won't give you the position. So it might be better to extend PDFStreamEngine, or use the DrawPrintTextLocations example. — Tilman Hausherr, Mar 13 '19 at 20:11
@Ramiro unfortunately the pdf format allows multiple different ways to set the position for the next text drawn, and these ways can be used for setting the position not only of the start of a text block but also the start of some tiny bit if text in some block. Furthermore, different pdf generators use these in different ways. Thus, there is no way to recognize from the use of specific operators that a "block of text" starts. Pdfbox, therefore, during text extraction does not try to determine such text blocks at all, it simply gives you the text bits with or without coordinates. — mkl, Mar 14 '19 at 05:47
If you have to operate on pdfs which all are generated by the same pdf generator, please share an example. We can help inspecting it and probably give tips how to extend pdfbox text extraction to recognize text blocks created by that pdf generator. — mkl, Mar 14 '19 at 05:51
@mkl Thank you for your thoughtful replies. Sadly I am dealing with PDFs generated from multiple sources (they even do have small differences in format). Right now I am trying to perform the same task by converting the PDF into an image and performing OCR on it. The OCR approach is working well although it is taking a lot of time. — Ramiro, Mar 18 '19 at 12:29

Iterate over text objects/boxes/labels on a PDF file

0 Answers0