Issue with ligatures when converting PDF to text

Question

I am running into an issue when trying to convert a PDF to text where the ligatures 'fi' 'ff' 'fl' are being converted to an empty space. I have read through quite a few similar threads on the issue but have not found a solution that works.

This converted text will then be used to match text within a database. So accuracy is paramount.

Link to PDF

fp = 'Inspection_redacted.pdf'

pdf = pdfplumber.open(fp)
fp = fp[:-3] + 'txt'
text_file = open(fp, "w")

for page in pdf.pages:
    text = page.extract_text()
    text_file.write(text)

pdf.close()
text_file.close()

Can you share what have you tried so far and/or give a mini screenshot of your `.pdf` ? — Timeless, Sep 14 '22 at 19:55
@abokey it would let me post an image initially but here is a mini screenshot of the pdf in question [link](https://imgur.com/vsH67jj) — Garrett, Sep 14 '22 at 21:21
I think you need to share your code so we can reproduce the issue. — Timeless, Sep 14 '22 at 21:48
@abokey I have attached a link to the PDF as well as my code. Thank you. — Garrett, Sep 15 '22 at 22:36
The problem is not pdfplumber, it is the PDF file that does not fully support text extraction. The ToUnicode cmaps attached to font objects map incorrectly the ligature glyph ids to <0000> — iPDFdev, Sep 16 '22 at 11:06
@KJ the actual PDF file's in questions are at least 70 pages long. The only part that needs converting are 2-3 pages however. The one I shared was edited to remove personal information. I had to edit this file in NitroPro and re-save as PDF, I am not sure if that would change the writer source? — Garrett, Sep 16 '22 at 15:30
@KJ All I am looking for is the converted plain text. This plain text is then matched against a database. It is looking for exact matches which is why accuracy is so important. The issue with the manual process is we are talking about thousands of PDFs — Garrett, Sep 16 '22 at 16:16

Timeless · Answer 1 · 2022-09-16T00:02:10.293

pdfplumber seems to not handle ligatures. 'fi', 'ff' and 'fl' are mapped to '\x00' (empty space) unicode mappings. One workaround is to, first, convert the .pdf to an image with pdf2image library then use an OCR tool (e.g, Python-tesseract) to recognize the text embedded in images from the .pdf.

Requirements :

To achieve that, you need to :

Install poppler from here
Install pdf2image: pip install pdf2image
Install tesseract from here
Install pytesseract: pip install pytesseract

Make sure to unzip poppler-0.68.0_x86.7z in C:\Program Files.

Code :

After installing all the requirements needed, you can run the code below :

from pdf2image import convert_from_path
from pytesseract import pytesseract
import os

pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

fp = 'Inspection_redacted.pdf'

text_file = open(fp[:-3] + 'txt', "w")

images = convert_from_path(fp, 500, poppler_path=r'C:\Program Files\poppler-0.68.0\bin')

for i, image in enumerate(images):
    fname = 'image'+str(i)+'.png'
    image.save(fname, "PNG")
    text = pytesseract.image_to_string(fname)
    text_file.write(text)
    os.remove(fname)

text_file.close()

`>>> Output`

@ 7.2.1 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: Chimney crown/cap cracked

7.2.2 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: CHIMNEY SWEEP - Excessive
Creosote

O 7.2.3 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: Cracks - in Firebox

O 7.2.4 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: Gaps - Seal

© 7.2.5 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: Chimney-Mortar Joint Gaps

See/Read here the text file in its entirety.

So I would really like to avoid OCR at all costs for some of the reasons K J mentioned. These PDFs have thousands of different text options and it would be impossible to know if OCR would convert them all correctly. If we can't find another way around it then I will mark this as the solution. — Garrett, Sep 16 '22 at 15:41

K J · Answer 2 · 2022-10-07T14:50:19.313

TL;DR so busy looking at PDF structure I forgot to test the best simple text extraction see end comments, that this is easiest with pdftotext.

I agree OCR can help to locate suspect ligatures, however it is likely that on its own the output may have as many OCR text errors as 14 ligatures you are trying to remove, thus either file compare both outputs for line by line differences (FC.exe or similar helps) or use the OCR fl fi positions to fix the source/output.

If you accept the plain text at face value it is easy to find and replace 14 known culprits in less groupings by dictionary means thus re would likely be fire not flre nor ffre but is possible it is re on its own thus flag that line context for double checking. If you use an editor you may see where correction is needed so now I see I missed an off in my first pass.

Other FnR's should be simpler so ooring is highly probably flooring and under oor unlikely to be any thing other than underfloor

un�nished most likely unfinished (here its easier to see any remaining culprits)

That PDF to text is respected in Xpdf and did it well but most users will have the more permissive poppler utils so remove the find filter and redirect outputs in a loop

pdftotext -enc UTF-8 -nopgbrk -layout "path\file.pdf" will output "path\file.txt"

here testing all 14 previously found

poppler-22.04 >library\bin\pdftotext -enc UTF-8 -layout ligatured.pdf -|find  /n "ff"
[23]  3.2.2 PLUMBING SYSTEM - FAUCETS, VALVES AND CONNECTED FIXTURES: Missing shut off handle

poppler-22.04 >library\bin\pdftotext -enc UTF-8 -layout ligatured.pdf -|find  /n "fi"
[56]insulation, air filters, registers): *Asbestos Ducts
[59]  7.2.1 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: Chimney crown/cap cracked
[61]7.2.2 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: CHIMNEY SWEEP - Excessive
[63]  7.2.3 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: Cracks - in Firebox
[64]  7.2.4 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: Gaps - Seal
[65]  7.2.5 FIREPLACES - FIREPLACES (including Gas/LP firelogs) AND CHIMNEYS: Chimney-Mortar Joint Gaps
[81]  11.2.1 ROOF - ROOF COVERINGS (Surface of roofing materials): Limited Life remaining
[82]  11.2.2 ROOF - ROOF COVERINGS (Surface of roofing materials): Shingle over Wood Shake
[88]13.2.1 INSULATION AND VENTILATION - INSULATION AND VAPOR RETARDERS (in unfinished spaces):
[91]13.2.2 INSULATION AND VENTILATION - INSULATION AND VAPOR RETARDERS (in unfinished spaces):
[94]13.2.3 INSULATION AND VENTILATION - INSULATION AND VAPOR RETARDERS (in unfinished spaces):
[97]13.2.4 INSULATION AND VENTILATION - INSULATION AND VAPOR RETARDERS (in unfinished spaces):

poppler-22.04 >library\bin\pdftotext -enc UTF-8 -layout ligatured.pdf -|find  /n "fl"
[70]  9.2.2 INTERIORS - INTERIORS - General and Visual Mold Assessment : Asbestos - drywall/flooring

I have the same hesitations about using OCR as you have mentioned. [This](https://tinywow.com/pdf/to-text) site is able to convert the file to text without OCR so I am thinking there must be a way — Garrett, Sep 16 '22 at 15:43

score 0 · Answer 3 · answered Mar 07 '23 at 09:41

disclaimer: I am the author of borb, the library used in this answer.

Ligatures are a font thing. And fonts are not one of the easiests things in the world of PDF. Extracting text typically means:

Extract all text rendering instructions
Organize those instructions in "logical reading order"
"Play" those instructions, keeping in mind where you are
Each instruction typically renders a glyph from a particular font
Either the font contains information such as "the glyphs need to be mapped to characters in this predefined way"
Or the font contains a to_unicode map, which tells you which character ID maps to which unicode character (and then you still need to map glyph IDs to character IDs)

(The above text is a simplification.)

That should give you some idea as to why your problem is so tricky.

Using borb you can pretend this problem does not exist (in most cases).

This is how you'd perform text-extraction using borb:

#!chapter_005/src/snippet_005.py
import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction


def main():

    # read the Document
    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open("output.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l])

    # check whether we have read a Document
    assert doc is not None

    # print the text on the first Page
    print(l.get_text()[0])


if __name__ == "__main__":
    main()

You open a PDF in rb mode, you attach an EventListener to the parser. The EventListener will get triggered every time a parsing instruction is performed. In this example we're using SimpleTextExtraction (which listens to page events and text-rendering events).

Afterwards, the renderer can be queried for useful information. E.g.:

the text on each page
the images in the PDF
the fonts being used on each page
the colors being used on each page
etc

SimpleTextExtraction is of course only concerned about which text was rendered on the Page.

There is a variant of SimpleTextExtraction that takes care of ligatures:

#!chapter_005/src/snippet_005.py
import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction
from borb.toolkit import SimpleNonLigatureTextExtraction



def main():

    # read the Document
    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleNonLigatureTextExtraction()
    with open("output.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l])

    # check whether we have read a Document
    assert doc is not None

    # print the text on the first Page
    print(l.get_text()[0])


if __name__ == "__main__":
    main()

You can download borb using PyPi, or directly from source. Be sure to check out the examples repository to get a thorough understanding of what you can do with borb.

Issue with ligatures when converting PDF to text

3 Answers3

Requirements :

Code :

>>> Output

`>>> Output`