Is it possible to get the bounding boxes for each word with Python?

Question

I know that

pdftotext -bbox foobar.pdf

creates a HTML file which contains content like

<word xMin="301.703800" yMin="104.483700" xMax="309.697000" yMax="115.283700">is</word>
<word xMin="313.046200" yMin="104.483700" xMax="318.374200" yMax="115.283700">a</word>
<word xMin="321.603400" yMin="104.483700" xMax="365.509000" yMax="115.283700">universal</word>
<word xMin="368.858200" yMin="104.483700" xMax="384.821800" yMax="115.283700">file</word>
<word xMin="388.291000" yMin="104.483700" xMax="420.229000" yMax="115.283700">format</word>

Hence each single word has a bounding box.

The Python package PDFminer in contrast seems only to be able to give the position of a block of text (see example).

How can I get the bounding boxes for each word in Python?

PyPDF2 an do this now with visitor functions: https://pypdf2.readthedocs.io/en/latest/user/extract-text.html — Martin Thoma, Nov 29 '22 at 16:58

score 1 · Answer 1 · answered Nov 27 '22 at 22:01

disclaimer: I am the author of borb, the package used in this answer.

You will need to do some kind of processing in order to get bounding boxes on a word-level. The problem is that a PDF (worst case scenario) only contains rendering instructions, and not structure-information.

Put simply, your PDF might contain (in pseudo-code):

move to 90, 700
set the active font to Helvetica, size 12
set the active color to black
render "Hello World" in the active font

The problem is that instruction 3 might contain anything from

a single letter
multiple letters
a single word,
to multiple words

In order to retrieve the bounding boxes of words, you'll need to do some processing (as mentioned before). You will need to render those instructions and split the text (preferably as it is being rendered) into words.

Then it's a matter of keeping track of the coordinates of the turtle, and you're set to go.

borb does this (under the hood) for you.

from borb.pdf import PDF
from borb.toolkit import RegularExpressionTextExtraction

# this line builds a RegularExpressionTextExtraction
# this class listens to rendering instructions 
# and performs the logic I mentioned in the text part of this answer
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[^ ]+")

# now we can load the file and perform our processing
with open("input.pdf", "rb") as fh:
    PDF.loads(fh, [l])

# now we just need to get the boxes out of it
# RegularExpressionTextExtraction returns a list of type PDFMatch
# this class can return a list of bounding boxes (should your
# regular expression ever need to be matched over separate lines of text)
for m in l.get_matches_for_page(0):
    # here we just print the Rectangle
    # but feel free to do something useful with it
    print(m.get_bounding_boxes()[0])

borb is an open source, pure Python PDF library that creates, modifies and reads PDF documents. You can download it using:

pip install borb

Alternatively, you can build from source by forking/downloading the GitHub repository.

Is it possible to get the bounding boxes for each word with Python?

1 Answers1