3

I am using pdfminer with python 3 and I get weird letters in the text that is recovered from the pdf.

For instance, I get significantinstead of significant (notice that the letters f and I are merged into one).

I have no idea why this is happening. This is the code I am using.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
from nltk.tokenize import sent_tokenize


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    sentences = sent_tokenize(text)

    for s in sentences:
        print(s)
        print("\n\n")

My only guess so far is that it may have to do with the encoding, but it seems that there is no way to retrieve the encoding of a pdf

Jongware
  • 22,200
  • 8
  • 54
  • 100
LBes
  • 3,366
  • 1
  • 32
  • 66
  • 2
    Have you verified using other means that the text in the PDF is actually stored as separate characters, and that you're indeed getting different results from pdfminer than via those other means? – Random Davis Oct 17 '18 at 21:10
  • 1
    Building on the previous comment, `fi` is a common [ligature](https://en.wikipedia.org/wiki/Typographic_ligature). – ChrisGPT was on strike Oct 17 '18 at 21:11
  • @RandomDavis I have opened several pdf files with several viewers and the the word is totally fine every time. Ctrl-F for "significant" does show the word. – LBes Oct 17 '18 at 21:12
  • 2
    Because (1) it is a very *very* common ligature which is known by many (apparently all of your) "several viewers", and/or (2) a PDF can contain meta-data that explicitly *state* that the correct translation of a glyph in a certain font should be another series of characters, and/or (3) any *string* in a PDF can have meta-data attached which contains the meaning "behind" the string, for the benefit of search engines, screen readers, and other software. – Jongware Oct 17 '18 at 21:17
  • @usr2564301 any way to avoid this issue then? – LBes Oct 17 '18 at 21:18
  • Avoid? Not unless you are willing to change pdfminer's code (which, in turn, needs far more knowlegde of a PDFs internal structure than you currently have). Easier to add a single line to replace `fi` with `fi` to your code. (There is at least one more "common" ligature.) – Jongware Oct 17 '18 at 21:23
  • @usr2564301 indeed it does need more knowledge than I have. I will look at other possible ligatures in pdf to make sure I can spot most of them. – LBes Oct 17 '18 at 21:25

1 Answers1

4

PDFminer is working correctly. The character in question is the Unicode character U+FB01, the fi ligature.

Add a line to replace with fi to your code:

for s in sentences:
    s = s.replace ('fi', 'fi')
    print (s)

There is one other very common – and purely typographic(*) – ligature defined in Unicode: U+FB02, the fl ligature; treat this the same:

    s = s.replace ('fl', 'fl')

and a couple of others in the Alphabetic Presentation block, which you might as well include too.

(*) Do not make the mistake to change æ to ae and œ to oe. These are not 'purely typographic ligatures' but valid characters on their own.

Jongware
  • 22,200
  • 8
  • 54
  • 100
  • Accepted. Thought there could be a way around it, but I'll have to stick with this and look for other possible ligatures. – LBes Oct 17 '18 at 21:28