reading text from PDF contains unknown encoding

Question

I'm using PyPDF4 to read text from a PDF I downloaded. This works, but the text string is not readable:

ÓŒŁ–Ł@`@äŽ–Ł@`@Ä›¥–Ž¢–@¥ŒŒŽ—–ﬁ–Ł
Áﬁ⁄–ﬂ–Ł–@›ŁƒŒŽﬂ†£›–

As far as I know the file is not encrypted, I can open it in Acrobat Reader without problem. In reader I can also select / copy / paste the text correctly.

for reference: this is the code:

import glob
import PyPDF4


relevant_path = 'C:\\_Personal\\Mega\\PycharmProjects\\PDFHandler\\docs\\input\\'

if __name__ == '__main__':

    for PDFFile in glob.iglob(relevant_path + '*.pdf', recursive=True):

        print('Processing File: ' + PDFFile.split('\\')[-1])
        pdfReader = PyPDF4.PdfFileReader(PDFFile)
        num_pages = pdfReader.numPages

        print(num_pages)

        page_count = 0
        text = ''

        while page_count < num_pages:
            pageObj = pdfReader.getPage(page_count)
            page_count += 1
            text += pageObj.extractText()

        print(text)

any hints? other packages I could use? ...

@KJ, I'm quite new to python, so if you could elaborate a bit further? I'm quite sure it is not encrypted since I need no password to open the file. I do get a notification in acrobat reader that "at least one signature has a problem" but I don't think this is causing this issue... — Spiffo, Nov 16 '22 at 15:40
Thanks for taking the time. However, I do not understand: if I can open the file in Acrobat (even with a signature issue) without "decrypting" it, is it the python package that cannot do the same? Is there a work around you can think of? — Spiffo, Nov 18 '22 at 12:58
Use [`pypdf`](https://pypi.org/project/pypdf/) instead of PyPDF2/PyPDF3/PyPDF4. I am the maintainer of pypdf and PyPDF2. We improved pypdf a lot in 2022. — Martin Thoma, Dec 26 '22 at 08:40

reading text from PDF contains unknown encoding

0 Answers0