disclaimer: I am the author of borb
, the package used in this answer.
You will need to do some kind of processing in order to get bounding boxes on a word-level. The problem is that a PDF (worst case scenario) only contains rendering instructions, and not structure-information.
Put simply, your PDF might contain (in pseudo-code):
- move to 90, 700
- set the active font to Helvetica, size 12
- set the active color to black
- render "Hello World" in the active font
The problem is that instruction 3 might contain anything from
- a single letter
- multiple letters
- a single word,
- to multiple words
In order to retrieve the bounding boxes of words, you'll need to do some processing (as mentioned before). You will need to render those instructions and split the text (preferably as it is being rendered) into words.
Then it's a matter of keeping track of the coordinates of the turtle, and you're set to go.
borb
does this (under the hood) for you.
from borb.pdf import PDF
from borb.toolkit import RegularExpressionTextExtraction
# this line builds a RegularExpressionTextExtraction
# this class listens to rendering instructions
# and performs the logic I mentioned in the text part of this answer
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[^ ]+")
# now we can load the file and perform our processing
with open("input.pdf", "rb") as fh:
PDF.loads(fh, [l])
# now we just need to get the boxes out of it
# RegularExpressionTextExtraction returns a list of type PDFMatch
# this class can return a list of bounding boxes (should your
# regular expression ever need to be matched over separate lines of text)
for m in l.get_matches_for_page(0):
# here we just print the Rectangle
# but feel free to do something useful with it
print(m.get_bounding_boxes()[0])
borb
is an open source, pure Python PDF library that creates, modifies and reads PDF documents. You can download it using:
pip install borb
Alternatively, you can build from source by forking/downloading the GitHub repository.