extracting text from pdf - PyPDF2

Question

I am following the tutorial on the page for extract text from pdf:

http://www.blog.pythonlibrary.org/2018/06/07/an-intro-to-pypdf2/

And I can print the pdf information, but I cannot print the content of the pages. It doesn't throw any error, but I can't see the text of the pdf either

What could be the problem?

from PyPDF2 import PdfFileReader


def get_info(path):
    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)
        info = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()

    #print(info)

    author = info.author
    creator = info.creator
    producer = info.producer
    subject = info.subject
    title = info.title


    print(author)
    print(creator)
    print(producer)
    print(subject)
    print(title)

def text_extractor(path):
    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)

        # get the first page
        page = pdf.getPage(0)
        print(page)
        print('Page type: {}'.format(str(type(page))))

        text = page.extractText()

        print(text) #THIS PART SHOULD PRINT TEXT FROM PDF, BUT DOESNT WORK



if __name__ == '__main__':
        #URL PDF: https://oficinavirtual.ugr.es/apli/solicitudPAU/test.pdf
    path = 'test.pdf'
    get_info(path)
    print("\n"*2)
    text_extractor(path)

Your code works fine for me. This probably comes from the file you are using (Maybe the first page is empty?). Also I have been using PyPDF2 for a while and it is not foolproof : pdfs encoded with old versions of Adobe, or converted from weird formats may not work, throws exceptions, or just return gibberish. So test your code with different files, and use a try/except. — SivolcC, Sep 11 '19 at 00:04
Does this answer your question? [Extract text from pdf converted from webpage using Pypdf2](https://stackoverflow.com/questions/60669890/extract-text-from-pdf-converted-from-webpage-using-pypdf2) — Ankit Veer Singh, May 18 '20 at 15:26

score 0 · Answer 1 · answered Nov 16 '19 at 09:39

0

Although this is not the solution, you can simply install pdfminer3 with pip and use minimal reproducible example here

answered Nov 16 '19 at 09:39

A.Ametov

1,790
1
13
20

extracting text from pdf - PyPDF2

1 Answers1