How to convert the binary text generated in my .PDF to a string?

Question

I am using this code:

from PyPDF2 import PdfFileReader

def text_extractor(path):
    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)

        # get the first page
        page = pdf.getPage(0)
        print(page)
        print('Page type: {}'.format(str(type(page))))

        text = page.extractText()
        print(text)


if __name__ == '__main__':
    path = 'XEROX.pdf'
    text_extractor(path)

But this return me:

{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
 '/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
 '/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
 '/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
 !ˆ"#$
[Finished in 0.9s]

Where is the data?

I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?

This is the result when I apply copy and paste in the PDF' information:

score 0 · Answer 1 · answered Nov 10 '18 at 20:44

0

I found it:

I clone the textraxt repository from gibhub. I installed textract (with some problems but i achieved) and work very good. I will edit this answer for include my code.

Regards

answered Nov 10 '18 at 20:44

Toni

97
1
10

How to convert the binary text generated in my .PDF to a string?

1 Answers1