Python: open PDF in in binary mode with UTF-8

Question

I am trying to open a PDF file using PyPDF4.

import PyPDF4

text = ""

pdf_file = open(filename,mode='rb')
pdfReader = PyPDF4.PdfFileReader(pdf_file)
pdfObj = pdfReader.getPage(0)
text = pageObj.extract(pdfObj)

print(text)

which works fine, except that the content of the PDF is German and that special characters (Umlaute) are encoded wrong (eg. zun−chst instead of zunächst).

I can't change the encoding in binary code, but if I don't use binary code I get the error

File "/usr/local/lib/python3.8/site-packages/PyPDF4/pdf.py", line 1754, in read stream.seek(-1, 2) io.UnsupportedOperation: can't do nonzero end-relative seeks

There are multiple threads to this error (eg. Seeking from end of file throwing unsupported exception) Yet, none of the solutions seem to work for me. Any help is much appreciated, thanks.

It's a bug in pyPDF2 and pyPDF3 and pyPDF4 - all three behave the same. Since only pyPDF3 seems to be active at this time i created an issue at https://github.com/sfneal/PyPDF3/issues/13 — Wolfgang Fahl, Oct 20 '21 at 15:07

score 1 · Accepted Answer · answered Oct 20 '21 at 15:21

@downbydawn had the same experience with the bug mentioned in the comment above

I ended up using a modified version of https://stackoverflow.com/a/26351413/1497139 :

# derived from
# https://stackoverflow.com/a/26351413/1497139

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO

class PDFMiner:
    '''
    PDFMiner wrapper to get PDF Text
    '''

    @classmethod
    def getPDFText(cls,pdfFilenamePath,throwError:bool=True):
        retstr = StringIO()
        parser = PDFParser(open(pdfFilenamePath,'rb'))
        try:
            document = PDFDocument(parser)
        except Exception as e:
            errMsg=f"error {pdfFilenamePath}:{str(e)}"
            print(errMsg)
            if throwError:
                raise e
            return ''
        if document.is_extractable:
            rsrcmgr = PDFResourceManager()
            device = TextConverter(rsrcmgr,retstr,  laparams = LAParams())
            interpreter = PDFPageInterpreter(rsrcmgr, device)
            for page in PDFPage.create_pages(document):
                interpreter.process_page(page)
            return retstr.getvalue()
        else:
            print(pdfFilenamePath,"Warning: could not extract text from pdf file.")
            return ''

tripleee · Answer 2 · 2020-10-23T08:46:58.133

-1

The PDF file is certainly binary; you should absolutely not try to use anything else than 'rb' mode to read it.

What you can do is decode the text you extracted. If you know the encoding is UTF-8 (which is probably not true, based on the example you show),

print(text.decode('utf-8'))

Based on your single sample, I think it's safe to say that the encoding is something else than UTF-8, but because we don't know which encoding you are using when you look at the text, this is all speculation. If you can show the actual bytes in the string, it should not be hard to figure out the actual encoding from a few samples, maybe with the help of a character chart like https://tripleee.github.io/8bit/. The character you pasted is U+2212 which doesn't directly appear to correspond to any common 8-bit encoding of ä, but maybe that's just a mistake in the paste.

Maybe see also Problematic questions about decoding errors for some background. Ideally perhaps update your question to provide the details it requests if this didn't already get you to a place where you can solve your problem yourself.

If PyPDF genuinely thinks that character is "−" then perhaps its extraction logic is wrong, or perhaps the PDF is flawed. If you can't fix it, probably simply manually remap the problematic characters as you find them. You might want to add a debug print with logging to highlight any character outside the printable ASCII range in the extracted text until you know you have covered them all.

import re
import logging

# ...
text = text.replace("\u2212", "ä").replace("\u1234", "ö")  # etc
for match in re.findall(r'(.{1,5})?([^äö\n -\u007f])(.{1,5})?', text):
    logging.warning("{0} found in {1}".format(match[1], "".join(match)))

Unfortunately, the above doesn't exactly work -- U+2212 in particular seems to be matched as part of the ASCII range no matter what re flags I pass in. (Notice also the placeholder "\u1234" -- replace that with something useful, and add more as you find them.)

edited Oct 23 '20 at 08:46

answered Oct 21 '20 at 09:43

tripleee

175,061
34
275
318

I initially forgot to include the line of code where the text is extracted. I can't decode the text string as it is already decoded. I haven't found a way to show the actual bytes yet, but will try to. – downbydawn Oct 21 '20 at 11:02
In brief, `repr("zun−chst".encode('utf-8'))` displays `"b'zun\\xe2\\x88\\x92chst'"` where the `b'...'` is Python's indicator that this is a byte string, and the `\x` escapes are used for any bytes which are not printable ASCII characters. This also conveniently shows you what the actual UTF-8 encoding of this character looks like. – tripleee Oct 21 '20 at 11:07
Okay, how can I view the actual bytes of the strings in the pdf? – downbydawn Oct 21 '20 at 11:31
`repr(text)` would be a good start but perhaps not yet sufficient. It is unfortunate that your code sample reuses the variable `text` but where you `print(text)` you should be able to `print(repr(text))`. – tripleee Oct 21 '20 at 11:43
`print(reprise(text))` gives out zun-chst – downbydawn Oct 21 '20 at 12:34
I guess you have `repr` not `reprise` though? – tripleee Oct 21 '20 at 12:36
Try `print(list("%02x" % ord(x) for x in text))` then? – tripleee Oct 21 '20 at 12:37
Yes, my bad. that will output ['7a', '75', '6e', '2212', '63', '68', '73', '74', '20'] – downbydawn Oct 21 '20 at 12:51
So yes you have U+2212 there but again, there is no encoding that I know of in which that represents ä, and in fact, it quite unambiguously can only represent U+2212 itself. Perhaps your PDF library is doing some shenanigans related to guessing the encoding, and guessing wrong. – tripleee Oct 21 '20 at 12:52
I haven't found the reason why it does shenanigans, but it works like a charm with pdfminer instead of PyPDF4. Thanks for the help! – downbydawn Oct 23 '20 at 08:42

Python: open PDF in in binary mode with UTF-8

2 Answers2