0

I'm using PyPDF4 to read text from a PDF I downloaded. This works, but the text string is not readable:

ÓŒŁ–Ł@`@䎖Ł@`@Ä›¥–Ž¢–@¥ŒŒŽ—–fi–Ł
Áfi⁄–fl–Ł–@›ŁƒŒŽfl†£›–

As far as I know the file is not encrypted, I can open it in Acrobat Reader without problem. In reader I can also select / copy / paste the text correctly.

for reference: this is the code:

import glob
import PyPDF4


relevant_path = 'C:\\_Personal\\Mega\\PycharmProjects\\PDFHandler\\docs\\input\\'

if __name__ == '__main__':

    for PDFFile in glob.iglob(relevant_path + '*.pdf', recursive=True):

        print('Processing File: ' + PDFFile.split('\\')[-1])
        pdfReader = PyPDF4.PdfFileReader(PDFFile)
        num_pages = pdfReader.numPages

        print(num_pages)

        page_count = 0
        text = ''

        while page_count < num_pages:
            pageObj = pdfReader.getPage(page_count)
            page_count += 1
            text += pageObj.extractText()

        print(text)

any hints? other packages I could use? ...

Spiffo
  • 3
  • 4
  • Entire documnet looks like this? or only some part of text? – Bhargav - Retarded Skills Nov 16 '22 at 13:09
  • entire output is like this – Spiffo Nov 16 '22 at 13:14
  • pdf file is my payslip, not eager to share online :) – Spiffo Nov 16 '22 at 14:00
  • @KJ, I'm quite new to python, so if you could elaborate a bit further? I'm quite sure it is not encrypted since I need no password to open the file. I do get a notification in acrobat reader that "at least one signature has a problem" but I don't think this is causing this issue... – Spiffo Nov 16 '22 at 15:40
  • Thanks for taking the time. However, I do not understand: if I can open the file in Acrobat (even with a signature issue) without "decrypting" it, is it the python package that cannot do the same? Is there a work around you can think of? – Spiffo Nov 18 '22 at 12:58
  • Use [`pypdf`](https://pypi.org/project/pypdf/) instead of PyPDF2/PyPDF3/PyPDF4. I am the maintainer of pypdf and PyPDF2. We improved pypdf a lot in 2022. – Martin Thoma Dec 26 '22 at 08:40

0 Answers0