Decoding problem with fitz.Document in Python 3.7

Question

I want to extract the text of a PDF and use some regular expressions to filter for information.

I am coding in Python 3.7.4 using fitz for parsing the pdf. The PDF is written in German. My code looks as follows:

doc = fitz.open(pdfpath)
pagecount = doc.pageCount
page = 0
content = ""

while (page < pagecount):
    p = doc.loadPage(page)
    page += 1
    content = content + p.getText()

Printing the content, I realized that the first (and important) half of the document is decoded as a strange mix of Japanese (?) signs and others, like this: ｮ｡ｵｳｷ･ｩｴｵｮｧ＠ｭ.

I tried to solve it with different decodings (latin-1, iso-8859-1), encoding is definitely in utf-8.

content= content+p.getText().encode("utf-8").decode("utf-8")

I also have tried to get the text using minecart:

import minecart

file = open(pdfpath, 'rb')

document = minecart.Document(file)

for page in document.iter_pages():
    for lettering in page.letterings :
        print(lettering)

which results in the same problem.

Using textract, the first half is an empty string:

import textract

text = textract.process(pdfpath)
print(text.decode('utf-8'))

Same thing with PyPDF2:

import PyPDF2

    
pdfFileObj = open(pdfpath, 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for index in range(0, pdfReader.numPages) :
    pageObj = pdfReader.getPage(index)
    print(pageObj.extractText())

I don't understand the problem as it's looking like a normal PDF with normal text. Also some of the PDFs don't have this problem.

Unfortunately I am not allowed to due to data protection guidelines, but the first half contains text in table format, the second half images with descriptions — Riprip, Aug 21 '20 at 11:18

Decoding problem with fitz.Document in Python 3.7

0 Answers0