The PDF file is certainly binary; you should absolutely not try to use anything else than 'rb'
mode to read it.
What you can do is decode the text you extracted. If you know the encoding is UTF-8 (which is probably not true, based on the example you show),
print(text.decode('utf-8'))
Based on your single sample, I think it's safe to say that the encoding is something else than UTF-8, but because we don't know which encoding you are using when you look at the text, this is all speculation. If you can show the actual bytes in the string, it should not be hard to figure out the actual encoding from a few samples, maybe with the help of a character chart like https://tripleee.github.io/8bit/. The character you pasted is U+2212 which doesn't directly appear to correspond to any common 8-bit encoding of ä, but maybe that's just a mistake in the paste.
Maybe see also Problematic questions about decoding errors for some background. Ideally perhaps update your question to provide the details it requests if this didn't already get you to a place where you can solve your problem yourself.
If PyPDF genuinely thinks that character is "−"
then perhaps its extraction logic is wrong, or perhaps the PDF is flawed. If you can't fix it, probably simply manually remap the problematic characters as you find them. You might want to add a debug print with logging
to highlight any character outside the printable ASCII range in the extracted text until you know you have covered them all.
import re
import logging
# ...
text = text.replace("\u2212", "ä").replace("\u1234", "ö") # etc
for match in re.findall(r'(.{1,5})?([^äö\n -\u007f])(.{1,5})?', text):
logging.warning("{0} found in {1}".format(match[1], "".join(match)))
Unfortunately, the above doesn't exactly work -- U+2212 in particular seems to be matched as part of the ASCII range no matter what re
flags I pass in. (Notice also the placeholder "\u1234"
-- replace that with something useful, and add more as you find them.)