23

PDFMiner's documentation says:

PDFMiner allows one to obtain the exact location of text in a page

However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.

technillogue
  • 1,482
  • 3
  • 16
  • 27

1 Answers1

26

You are looking for the bbox property on every layout object. There is a little bit of information on how to parse the layout hierarchy in the PDFMiner documentation, but it doesn't cover everything.

Here's an example:

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure


def parse_layout(layout):
    """Function to recursively parse the layout tree."""
    for lt_obj in layout:
        print(lt_obj.__class__.__name__)
        print(lt_obj.bbox)
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            print(lt_obj.get_text())
        elif isinstance(lt_obj, LTFigure):
            parse_layout(lt_obj)  # Recursive


fp = open('example.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
    interpreter.process_page(page)
    layout = device.get_result()
    parse_layout(layout)

If you are interested in the location of individual LTChar objects, you can recursively parse into the child layout objects of LTTextBox and LTTextLine just like what is done with LTFigure in the above example.

Matt Swain
  • 3,827
  • 4
  • 25
  • 36
  • 1) Could you explain what LAParams() does, please? 2) Isn't it more pythonic to try to get text and then try to recurse rather than using isinstance? – technillogue Aug 12 '14 at 16:27
  • Aren't there other types of containers other than LTFigure? – technillogue Aug 12 '14 at 16:28
  • 1
    LAParams contains the parameters used for the layout analysis that merges characters into words and lines based on their locations. You can pass initialization parameters like line_overlap, char_margin, line_margin, word_margin, boxes_flow, detect_vertical. See PDFMiner docs for explanation and default values. – Matt Swain Aug 12 '14 at 16:38
  • 1
    Other than `LTFigure` there's also `LTTextBox` that contains `LTTextLine` which in turn contains `LTChar` and `LTAnno`. The [PDFMiner docs](https://euske.github.io/pdfminer/programming.html) have a diagram of the hierarchy. – Matt Swain Aug 12 '14 at 16:39
  • Things seem to work without passing LAParams, why are they needed? Isn't it more Pythonic to EAFP rather then use isinstance? – technillogue Aug 12 '14 at 17:01
  • 1
    `LAParams` is really just a way to modify the parameters used by the layout analyser. It's good practice to pass to `PDFPageAggregator` even if you just use the default parameters, because otherwise some of the layout analysis may not be performed. You probably can make my `parse_layout` function more pythonic. Every `LT*` object should be iterable even if it doesn't have any children, so the `LTFigure` isinstance check is probably unnecessary. Similarly, you could just attempt `get_text()` for all and catch the failure if it's not implemented on that `LT*` object. – Matt Swain Aug 13 '14 at 12:06
  • Is there any way to parse just first LTTextBox of each page?(actually I want the box header ) – sunny Jan 24 '18 at 21:38
  • What's your basis for thinking that recursing into `LTFigure`s like this works? Over at https://stackoverflow.com/a/53360415/1709587, I claim it's broken because an `LTFigure` cannot contain an `LTTextBox`... but if I'm wrong, I'd appreciate you proving me so. – Mark Amery Nov 18 '18 at 11:44
  • Rather using `LTTextBox`, is there another parameter that will just find coordinates for individual words? – Starbucks Dec 04 '19 at 20:17