disclaimer: I am the author of pText
, the library used in this answer.
load the Document
you need to define a LocationFilter
A LocationFilter
does pretty much what it says on the tin. It will listen to parsing events (like "render text" or "change font to") but it will only allow those to come through that fall within a given boundary.
Keep in mind the origin in PDF coordinates is at the lower left corner.
The LocationFilter
in this example will therefor match only text in the lower left corner of the page.
- Add a
SimpleTextExtraction
to the LocationFilter
The next question is "what is the LocationFilter
going to pass events to?"
In this case, you can start by trying a SimpleTextExtraction
.
Putting it all together:
l0 = LocationFilter(0, 0, 100, 100)
l1 = SimpleTextExtraction()
l0.add_listener(l1)
doc = PDF.loads(pdf_file_handle, [l])
After the Document has loaded, you can ask the SimpleTextExtraction
for all the text on a given Page
.
l1.get_text(0)
You can obtain pText either on GitHub, or using PyPi
There are a ton more examples, check them out to find out more about working with images.