disclaimer: I am the author of borb
, the library used in this answer.
Ligatures are a font thing. And fonts are not one of the easiests things in the world of PDF. Extracting text typically means:
- Extract all text rendering instructions
- Organize those instructions in "logical reading order"
- "Play" those instructions, keeping in mind where you are
- Each instruction typically renders a glyph from a particular font
- Either the font contains information such as "the glyphs need to be mapped to characters in this predefined way"
- Or the font contains a
to_unicode
map, which tells you which character ID maps to which unicode character (and then you still need to map glyph IDs to character IDs)
(The above text is a simplification.)
That should give you some idea as to why your problem is so tricky.
Using borb
you can pretend this problem does not exist (in most cases).
This is how you'd perform text-extraction using borb
:
#!chapter_005/src/snippet_005.py
import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction
def main():
# read the Document
doc: typing.Optional[Document] = None
l: SimpleTextExtraction = SimpleTextExtraction()
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l])
# check whether we have read a Document
assert doc is not None
# print the text on the first Page
print(l.get_text()[0])
if __name__ == "__main__":
main()
You open a PDF in rb
mode, you attach an EventListener
to the parser. The EventListener
will get triggered every time a parsing instruction is performed. In this example we're using SimpleTextExtraction
(which listens to page events and text-rendering events).
Afterwards, the renderer can be queried for useful information. E.g.:
- the text on each page
- the images in the PDF
- the fonts being used on each page
- the colors being used on each page
- etc
SimpleTextExtraction
is of course only concerned about which text was rendered on the Page
.
There is a variant of SimpleTextExtraction
that takes care of ligatures:
#!chapter_005/src/snippet_005.py
import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction
from borb.toolkit import SimpleNonLigatureTextExtraction
def main():
# read the Document
doc: typing.Optional[Document] = None
l: SimpleTextExtraction = SimpleNonLigatureTextExtraction()
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l])
# check whether we have read a Document
assert doc is not None
# print the text on the first Page
print(l.get_text()[0])
if __name__ == "__main__":
main()
You can download borb
using PyPi, or directly from source.
Be sure to check out the examples repository to get a thorough understanding of what you can do with borb
.